用R语言做数据分析:回归分析(二)

继续一些回归分析的例子,这里我们将事先构造一些已知的数据,然后利用回归算法来看看结果。
二次函数回归
第一个例子是一个二次函数,每个y值加上一个均值为0的随机扰动。然后用来做回归分析。

> x=seq(1,30)
> y=x*x + x + 2
> ep=rnorm(30)
> y=y+ep
> dd=data.frame(x=x,y=y)
> mod=lm(y~x+I(x*x),data=dd)
> summary(mod)

Call:
lm(formula = y ~ x + I(x * x), data = dd)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.48335 -0.41626  0.01971  0.37353  1.70751 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.283228   0.478407   4.773 5.61e-05 ***
x           0.957549   0.071140  13.460 1.72e-13 ***
I(x * x)    1.001245   0.002227 449.657  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8159 on 27 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.829e+06 on 2 and 27 DF,  p-value: < 2.2e-16

可以看到这个R^2基本上等于1了。给出的结果和期望的一致。
多元线性回归
下面再看一个多元回归的例子,这个例子中,给定4个x , 并且用给定的方程生成y,并且在y上加上均值为0的随机扰动,看看回归的结果如何。

> x1=rnorm(30)
> x2=rnorm(30)
> x3=rnorm(30)
> x4=rnorm(30)
> y=x1+3*x2-5*x3+3.998*x4
> ep=rnorm(30)
> y=y+ep
> dd=data.frame(x1=x1,x2=x2,x3=x3,x4=x4,y=y)
> mod=lm(y~x1+x2+x3+x4,data=dd)
> summary(mod)

Call:
lm(formula = y ~ x1 + x2 + x3 + x4, data = dd)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.00943 -0.47345  0.07849  0.44839  2.10942 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.1670     0.2115   0.790    0.437    
x1            1.2142     0.2343   5.182 2.33e-05 ***
x2            3.6530     0.2466  14.814 6.95e-14 ***
x3           -4.8926     0.1656 -29.536  < 2e-16 ***
x4            3.9775     0.2205  18.039 7.60e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.056 on 25 degrees of freedom
Multiple R-squared:  0.9789,    Adjusted R-squared:  0.9755 
F-statistic: 289.6 on 4 and 25 DF,  p-value: < 2.2e-16

如果我们把这个随机的扰动去掉,算出来的结果和我们给出的系数可以说是完全一致

> y=y-ep
> dd=data.frame(x1=x1,x2=x2,x3=x3,x4=x4,y=y)
> mod2=lm(y~x1+x2+x3+x4-1,data=dd)
> summary(mod2)

Call:
lm(formula = y ~ x1 + x2 + x3 + x4 - 1, data = dd)

Residuals:
       Min         1Q     Median         3Q        Max 
-3.524e-15 -5.286e-16  2.387e-16  1.028e-15  2.730e-15 

Coefficients:
     Estimate Std. Error    t value Pr(>|t|)    
x1  1.000e+00  2.985e-16  3.351e+15   <2e-16 ***
x2  3.000e+00  3.038e-16  9.874e+15   <2e-16 ***
x3 -5.000e+00  2.078e-16 -2.406e+16   <2e-16 ***
x4  3.998e+00  2.690e-16  1.486e+16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.345e-15 on 26 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.937e+32 on 4 and 26 DF,  p-value: < 2.2e-16

Warning message:
In summary.lm(mod2) : essentially perfect fit: summary may be unreliable

多元多项式回归
继续,如果我们把给定的关系修改成含有二次项的公式呢?其中x4变成二次方,详细如下

> y=x1+3*x2-5*x3+3.998*x4*x4
> y=y+ep
> dd=data.frame(x1=x1,x2=x2,x3=x3,x4=x4,y=y)
> mod3=lm(y~x1+x2+x3+x4,data=dd)
> summary(mod3)

Call:
lm(formula = y ~ x1 + x2 + x3 + x4, data = dd)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8165 -2.7676 -0.5798  1.8480 11.2441 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.3588     0.7948   4.226 0.000277 ***
x1            3.2356     0.8805   3.675 0.001137 ** 
x2            3.9518     0.9268   4.264 0.000251 ***
x3           -3.7740     0.6226  -6.062 2.47e-06 ***
x4           -0.6444     0.8287  -0.778 0.444074    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.968 on 25 degrees of freedom
Multiple R-squared:  0.7164,    Adjusted R-squared:  0.671 
F-statistic: 15.79 on 4 and 25 DF,  p-value: 1.435e-06

这里,看到x4的系数是显著为零的。预示我们给定的回归方程是有问题的。减去x4看看,

> mod3=lm(y~x1+x2+x3,data=dd)
> summary(mod3)

Call:
lm(formula = y ~ x1 + x2 + x3, data = dd)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.0488 -2.6193 -0.4313  2.2402 12.3472 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.5371     0.7552   4.684 7.74e-05 ***
x1            3.1604     0.8685   3.639 0.001190 ** 
x2            3.9322     0.9194   4.277 0.000226 ***
x3           -3.9055     0.5946  -6.569 5.77e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.938 on 26 degrees of freedom
Multiple R-squared:  0.7095,    Adjusted R-squared:  0.676 
F-statistic: 21.17 on 3 and 26 DF,  p-value: 3.744e-07

去掉x4以后,x1,x2,x3的系数都是显著的,但是R^2的值并小了,说明这个拟合效果比上面的那个更差。也就是x4不能拿掉。给出x4的平方项

> mod3=lm(y~x1+x2+x3+I(x4*x4),data=dd)
> summary(mod3)

Call:
lm(formula = y ~ x1 + x2 + x3 + I(x4 * x4), data = dd)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2077 -0.3817  0.1617  0.5920  1.9135 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.3319     0.2628   1.263    0.218    
x1            1.3040     0.2496   5.224 2.09e-05 ***
x2            3.6656     0.2428  15.096 4.54e-14 ***
x3           -4.8499     0.1647 -29.445  < 2e-16 ***
I(x4 * x4)    3.8088     0.2039  18.681 3.36e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.038 on 25 degrees of freedom
Multiple R-squared:  0.9806,    Adjusted R-squared:  0.9775 
F-statistic: 315.7 on 4 and 25 DF,  p-value: < 2.2e-16

拟合的R^2达到了0.98 , 各系数也基本上是显著的,可见上面的回归方程没问题。当然,在预先不知道回归方程的情况下,要给出上面的回归方程是有些困难的。但是这里至少给出一个概念,如果R^2值过小,说明回归方程是有问题的。



本文地址: http://www.bagualu.net/wordpress/archives/4967 转载请注明




发表评论

电子邮件地址不会被公开。 必填项已用*标注