General Statistics Examples

Example 10.3 Regression

This example is a module that calculates statistics that are associated with a linear regression. The following module is similar to the REGRESS module , which is included in the IMLMLIB library:

proc iml;
start regress( x, y, name, tval=, l1=, l2=, l3= );
   n = nrow(x);                          /* number of observations   */
   k = ncol(x);                          /* number of variables      */
   xpx = x` * x;                         /* cross-products           */
   xpy = x` * y;
   xpxi = inv(xpx);                      /* inverse crossproducts    */

   b = xpxi * xpy;                       /* parameter estimates      */
   yhat = x * b;                         /* predicted values         */
   resid = y - yhat;                     /* residuals                */
   sse = resid` * resid;                 /* sum of squared errors    */
   dfe = n - k;                          /* degrees of freedom error */
   mse = sse / dfe;                      /* mean squared error       */
   rmse = sqrt(mse);                     /* root mean squared error  */

   covb = xpxi # mse;                    /* covariance of estimates  */
   stdb = sqrt(vecdiag(covb));           /* standard errors          */
   t = b / stdb;                         /* ttest for estimates=0    */
   probt = 1 - cdf("F",t#t,1,dfe);       /* significance probability */
   paramest = b || stdb || t || probt;
   print paramest[c={"Estimate" "StdErr" "t" "Pr>|t|"} r=name
                  l="Parameter Estimates" f=Best6.];

   s = diag(1/stdb);
   corrb = s * covb * s;                 /* correlation of estimates */
   reset fw=6 spaces=3;                  /* for proper formatting    */
   print covb[r=name c=name l="Covariance of Estimates"],
         corrb[r=name c=name l="Correlation of Estimates"];

   if nrow(tval) = 0 then return;        /* is a t-value specified?  */
   projx = x * xpxi * x`;                /* hat matrix               */
   vresid = (i(n) - projx) * mse;        /* covariance of residuals  */
   vpred = projx # mse;                  /* covariance of pred vals  */
   h = vecdiag(projx);                   /* hat leverage values      */
   lowerm = yhat - tval # sqrt(h*mse);   /* lower conf limit for mean*/
   upperm = yhat + tval # sqrt(h*mse);   /* upper CL for mean        */
   lower = yhat - tval # sqrt(h*mse+mse);/* lower CL for individual  */
   upper = yhat + tval # sqrt(h*mse+mse);/* upper CL                 */

   R = y || yhat || resid || h || lowerm || upperm || lower || upper;
   labels = {"y" "yhat" "resid" "h" "lowerm" "upperm" "lower" "upper"};
   reset fw=6 spaces=1;
   print R[c=labels label="Predicted values, Residuals, and Limits"];

   /* test hypoth that L*b = 0, where L is linear comb of estimates  */
   do i = 1 to 3;
      L = value("L"+ strip(char(i)));   /* get L matrix for L1, L2, and L3 */
      if nrow(L) = 0 then return;
      dfn = nrow(L);
      Lb = L * b;
      vLb = L * xpxi * L`;
      q = Lb` * inv(vLb) * Lb / dfn;
      f = q / mse;
      prob = 1 - cdf("F", f,dfn,dfe);
      test = dfn || dfe || f || prob;
      print L, test[c={"dfn" "dfe" "F" "Pr>F"} f=Best6.
                    l="Test Hypothesis that L*b = 0"];
   end;
finish;

The module accepts up to three matrices for testing the hypothesis that a linear combination of the parameters is zero. For more information about the computation, see the documentation for the TEST statement in the REG procedure in SAS/STAT User's Guide.

The following statements call the REGRESS module on data that describes the size of the US population during eight decades 1790–1860. The following statements fit a quadratic regression model to the data. The program also tests three hypotheses about the parameters in the model.

/* Quadratic regression on US population for decades beginning 1790 */
decade = T(1:8);
name={"Intercept", "Decade", "Decade**2" };
x= decade##0 || decade || decade##2;
y= {3.929,5.308,7.239,9.638,12.866,17.069,23.191,31.443};

/* n-p=5 dof at 0.025 level to get 95% confidence interval */
tval = quantile("T", 1-0.025, nrow(x)-ncol(x));
L1 = { 0 1 0 };   /* test hypothesis Lb=0 for linear coef */
L2 = { 0 1 0,     /* test hypothesis Lb=0 for linear,quad */
       0 0 1 };
L3 = { 0 1 1 };   /* test hypothesis Lb=0 for linear+quad */
option linesize=100;
run regress( x, y, name, tval, L1, L2, L3 );

The parameters estimates are shown in the first table in Output 10.3.1. The next two tables show the covariance and correlation of the estimates, respectively.

Output 10.3.1: Regression Results

Parameter Estimates
	Estimate	StdErr	t	Pr>\|t\|
Intercept	5.0693	0.9656	5.25	0.0033
Decade	-1.11	0.4923	-2.255	0.0739
Decade**2	0.5396	0.0534	10.106	0.0002

Covariance of Estimates
	Intercept	Decade	Decade**2
Intercept	0.9324	-0.436	0.0428
Decade	-0.436	0.2424	-0.026
Decade**2	0.0428	-0.026	0.0029

Correlation of Estimates
	Intercept	Decade	Decade**2
Intercept	1	-0.918	0.8295
Decade	-0.918	1	-0.976
Decade**2	0.8295	-0.976	1

The predicted values, residuals, leverage, and confidence limits for the mean and individual predictions are shown in Output 10.3.2.

Output 10.3.2: Regression Results: Predicted Values and Residuals

Predicted values, Residuals, and Limits
y	yhat	resid	h	lowerm	upperm	lower	upper
3.929	4.499	-0.57	0.7083	3.0017	5.9964	2.1737	6.8244
5.308	5.008	0.3	0.2798	4.067	5.949	2.9954	7.0207
7.239	6.5963	0.6427	0.2321	5.7391	7.4535	4.6214	8.5711
9.638	9.2638	0.3742	0.2798	8.3228	10.205	7.2511	11.276
12.866	13.011	-0.145	0.2798	12.07	13.952	10.998	15.023
17.069	17.837	-0.768	0.2321	16.979	18.694	15.862	19.812
23.191	23.742	-0.551	0.2798	22.801	24.683	21.729	25.755
31.443	30.727	0.7164	0.7083	29.229	32.224	28.401	33.052

The results of the hypothesis tests are shown in Output 10.3.3. The first hypothesis is that the coefficient of the linear term is 0. This hypothesis is not rejected at the 0.05 significance level. The second hypothesis is that the coefficients of the linear and quadratic terms are simultaneously 0. This hypothesis is soundly rejected. The third hypothesis is that the linear coefficient is equal to the negative of the quadratic coefficient. Given the data, this hypothesis is not rejected.

Output 10.3.3: Regression Results: Hypothesis Tests

L
0	1	0

Test Hypothesis that L*b = 0
dfn	dfe	F	Pr>F
1	5	5.0832	0.0739

L
0	1	0
0	0	1

Test Hypothesis that L*b = 0
dfn	dfe	F	Pr>F
2	5	666.51	854E-9

L
0	1	1

Test Hypothesis that L*b = 0
dfn	dfe	F	Pr>F
1	5	1.6775	0.2518