General Statistics Examples


Example 10.3 Regression

This example is a module that calculates statistics that are associated with a linear regression. The following module is similar to the REGRESS module , which is included in the IMLMLIB library:

proc iml;
start regress( x, y, name, tval=, l1=, l2=, l3= );
   n = nrow(x);                          /* number of observations   */
   k = ncol(x);                          /* number of variables      */
   xpx = x` * x;                         /* cross-products           */
   xpy = x` * y;
   xpxi = inv(xpx);                      /* inverse crossproducts    */

   b = xpxi * xpy;                       /* parameter estimates      */
   yhat = x * b;                         /* predicted values         */
   resid = y - yhat;                     /* residuals                */
   sse = resid` * resid;                 /* sum of squared errors    */
   dfe = n - k;                          /* degrees of freedom error */
   mse = sse / dfe;                      /* mean squared error       */
   rmse = sqrt(mse);                     /* root mean squared error  */

   covb = xpxi # mse;                    /* covariance of estimates  */
   stdb = sqrt(vecdiag(covb));           /* standard errors          */
   t = b / stdb;                         /* ttest for estimates=0    */
   probt = 1 - cdf("F",t#t,1,dfe);       /* significance probability */
   paramest = b || stdb || t || probt;
   print paramest[c={"Estimate" "StdErr" "t" "Pr>|t|"} r=name
                  l="Parameter Estimates" f=Best6.];

   s = diag(1/stdb);
   corrb = s * covb * s;                 /* correlation of estimates */
   reset fw=6 spaces=3;                  /* for proper formatting    */
   print covb[r=name c=name l="Covariance of Estimates"],
         corrb[r=name c=name l="Correlation of Estimates"];

   if nrow(tval) = 0 then return;        /* is a t-value specified?  */
   projx = x * xpxi * x`;                /* hat matrix               */
   vresid = (i(n) - projx) * mse;        /* covariance of residuals  */
   vpred = projx # mse;                  /* covariance of pred vals  */
   h = vecdiag(projx);                   /* hat leverage values      */
   lowerm = yhat - tval # sqrt(h*mse);   /* lower conf limit for mean*/
   upperm = yhat + tval # sqrt(h*mse);   /* upper CL for mean        */
   lower = yhat - tval # sqrt(h*mse+mse);/* lower CL for individual  */
   upper = yhat + tval # sqrt(h*mse+mse);/* upper CL                 */

   R = y || yhat || resid || h || lowerm || upperm || lower || upper;
   labels = {"y" "yhat" "resid" "h" "lowerm" "upperm" "lower" "upper"};
   reset fw=6 spaces=1;
   print R[c=labels label="Predicted values, Residuals, and Limits"];

   /* test hypoth that L*b = 0, where L is linear comb of estimates  */
   do i = 1 to 3;
      L = value("L"+ strip(char(i)));   /* get L matrix for L1, L2, and L3 */
      if nrow(L) = 0 then return;
      dfn = nrow(L);
      Lb = L * b;
      vLb = L * xpxi * L`;
      q = Lb` * inv(vLb) * Lb / dfn;
      f = q / mse;
      prob = 1 - cdf("F", f,dfn,dfe);
      test = dfn || dfe || f || prob;
      print L, test[c={"dfn" "dfe" "F" "Pr>F"} f=Best6.
                    l="Test Hypothesis that L*b = 0"];
   end;
finish;

The module accepts up to three matrices for testing the hypothesis that a linear combination of the parameters is zero. For more information about the computation, see the documentation for the TEST statement in the REG procedure in SAS/STAT User's Guide.

The following statements call the REGRESS module on data that describes the size of the US population during eight decades 1790–1860. The following statements fit a quadratic regression model to the data. The program also tests three hypotheses about the parameters in the model.

/* Quadratic regression on US population for decades beginning 1790 */
decade = T(1:8);
name={"Intercept", "Decade", "Decade**2" };
x= decade##0 || decade || decade##2;
y= {3.929,5.308,7.239,9.638,12.866,17.069,23.191,31.443};

/* n-p=5 dof at 0.025 level to get 95% confidence interval */
tval = quantile("T", 1-0.025, nrow(x)-ncol(x));
L1 = { 0 1 0 };   /* test hypothesis Lb=0 for linear coef */
L2 = { 0 1 0,     /* test hypothesis Lb=0 for linear,quad */
       0 0 1 };
L3 = { 0 1 1 };   /* test hypothesis Lb=0 for linear+quad */
option linesize=100;
run regress( x, y, name, tval, L1, L2, L3 );

The parameters estimates are shown in the first table in Output 10.3.1. The next two tables show the covariance and correlation of the estimates, respectively.

Output 10.3.1: Regression Results

Parameter Estimates
  Estimate StdErr t Pr>|t|
Intercept 5.0693 0.9656 5.25 0.0033
Decade -1.11 0.4923 -2.255 0.0739
Decade**2 0.5396 0.0534 10.106 0.0002

Covariance of Estimates
  Intercept Decade Decade**2
Intercept 0.9324 -0.436 0.0428
Decade -0.436 0.2424 -0.026
Decade**2 0.0428 -0.026 0.0029

Correlation of Estimates
  Intercept Decade Decade**2
Intercept 1 -0.918 0.8295
Decade -0.918 1 -0.976
Decade**2 0.8295 -0.976 1



The predicted values, residuals, leverage, and confidence limits for the mean and individual predictions are shown in Output 10.3.2.

Output 10.3.2: Regression Results: Predicted Values and Residuals

Predicted values, Residuals, and Limits
y yhat resid h lowerm upperm lower upper
3.929 4.499 -0.57 0.7083 3.0017 5.9964 2.1737 6.8244
5.308 5.008 0.3 0.2798 4.067 5.949 2.9954 7.0207
7.239 6.5963 0.6427 0.2321 5.7391 7.4535 4.6214 8.5711
9.638 9.2638 0.3742 0.2798 8.3228 10.205 7.2511 11.276
12.866 13.011 -0.145 0.2798 12.07 13.952 10.998 15.023
17.069 17.837 -0.768 0.2321 16.979 18.694 15.862 19.812
23.191 23.742 -0.551 0.2798 22.801 24.683 21.729 25.755
31.443 30.727 0.7164 0.7083 29.229 32.224 28.401 33.052



The results of the hypothesis tests are shown in Output 10.3.3. The first hypothesis is that the coefficient of the linear term is 0. This hypothesis is not rejected at the 0.05 significance level. The second hypothesis is that the coefficients of the linear and quadratic terms are simultaneously 0. This hypothesis is soundly rejected. The third hypothesis is that the linear coefficient is equal to the negative of the quadratic coefficient. Given the data, this hypothesis is not rejected.

Output 10.3.3: Regression Results: Hypothesis Tests

L
0 1 0

Test Hypothesis that L*b = 0
dfn dfe F Pr>F
1 5 5.0832 0.0739

L
0 1 0
0 0 1

Test Hypothesis that L*b = 0
dfn dfe F Pr>F
2 5 666.51 854E-9

L
0 1 1

Test Hypothesis that L*b = 0
dfn dfe F Pr>F
1 5 1.6775 0.2518