Tutorial: A Module for Linear Regression

A Module for Linear Regression

The previous method might be more familiar to statisticians when different notation is used. A linear model is usually written as

where

is the vector of responses,

is the design matrix, and

is a vector of unknown parameters estimated by minimizing the sum of squares of

, the error or residual.

The following example illustrates the programming techniques involved in performing linear regression. It is not meant to replace regression procedures such as the REG procedure, which are more efficient for regressions and offer a multitude of diagnostic options.

Suppose you have response data measured at five values of the independent variable and you want to perform a quadratic regression.

Submit the PROC IML statement to begin the procedure:

  
    > proc iml; 
  
       IML Ready

Input the design matrix and the data vector as matrix literals:

  
    > x={1 1 1, 
    >    1 2 4, 
    >    1 3 9, 
    >    1 4 16, 
    >    1 5 25}; 
  
  
               X           5 rows      3 cols    (numeric) 
  
                             1         1         1 
                             1         2         4 
                             1         3         9 
                             1         4        16 
                             1         5        25 
  
  
    > y={1,5,9,23,36}; 
  
               Y           5 rows      1 col     (numeric) 
  
                                       1 
                                       5 
                                       9 
                                      23 
                                      36

Compute the least-squares estimate of by using the traditional formula:

  
    > b=inv(x`*x)*x`*y; 
  
               B             3 rows      1 col     (numeric) 
  
                                         2.4 
                                        -3.2 
                                           2

The predicted values are simply the matrix multiplied by the parameter estimates, and the residuals are the difference between actual and predicted , as follows:

  
    > yhat=x*b; 
  
               YHAT          5 rows      1 col     (numeric) 
  
                                         1.2 
                                           4 
                                        10.8 
                                        21.6 
                                        36.4

  
    > r=y-yhat; 
  
               R           5 rows      1 col     (numeric) 
  
                                       -0.2 
                                          1 
                                       -1.8 
                                        1.4 
                                       -0.4

To calculate the estimate of the variance of the responses, calculate the sum of squared errors (SSE), its degrees of freedom (DFE), and the mean squared error (MSE) as follows. Note that in computing the degrees, you use the function NCOL to return the number of columns of

  
    > sse=ssq(r); 
  
               SSE           1 row       1 col     (numeric) 
  
                                         6.4 
  
    > dfe=nrow(x)-ncol(x); 
  
               DFE           1 row       1 col     (numeric) 
  
                                         2 
  
  
    > mse=sse/dfe; 
  
               MSE           1 row       1 col     (numeric) 
  
                                        3.2

Notice that each calculation has required one simple line of code.

Now suppose you want to solve the problem repeatedly on new data sets without reentering the code. To do this, define a module (or subroutine). Modules begin with a START statement and end with a FINISH statement, with the program statements in between. The following statements define a module named REGRESS to perform linear regression:

  
    > start regress;                  /* begin module */ 
    >    xpxi=inv(t(x)*x);         /* inverse of X'X          */ 
    >    beta=xpxi*(t(x)*y);       /* parameter estimate      */ 
    >    yhat=x*beta;              /* predicted values        */ 
    >    resid=y-yhat;                 /* residuals           */ 
    >    sse=ssq(resid);               /* SSE                 */ 
    >    n=nrow(x);                    /* sample size         */ 
    >    dfe=nrow(x)-ncol(x);          /* error DF            */ 
    >    mse=sse/dfe;                  /* MSE                 */ 
    >    cssy=ssq(y-sum(y)/n);         /* corrected total SS  */ 
    >    rsquare=(cssy-sse)/cssy;      /* RSQUARE             */ 
    >    print,"Regression Results", 
    >       sse dfe mse rsquare; 
    >    stdb=sqrt(vecdiag(xpxi)*mse); /* std of estimates    */ 
    >    t=beta/stdb;                  /* parameter t tests   */ 
    >    prob=1-probf(t#t,1,dfe);      /* p-values            */ 
    >    print,"Parameter Estimates",, 
    >       beta stdb t prob; 
    >    print,y yhat resid; 
    > finish regress;                  /* end module          */

Submit the module REGRESS for execution as follows:

  
    > reset noprint; 
    > run regress;                     /* execute module      */ 
  
  
                               Regression Results 
  
  
                        SSE       DFE       MSE    RSQUARE 
                        6.4         2       3.2  0.9923518 
  
                              Parameter Estimates 
  
  
                       BETA       STDB          T       PROB 
                        2.4  3.8366652  0.6255432  0.5954801 
                       -3.2  2.9237940  -1.094468  0.3879690 
                          2  0.4780914  4.1833001  0.0526691 
  
                                Y      YHAT     RESID 
                                1       1.2      -0.2 
                                5         4         1 
                                9      10.8      -1.8 
                               23      21.6       1.4 
                               36      36.4      -0.4

At this point, you still have all of the matrices defined if you want to continue calculations. Suppose you want to correlate the estimates. First, calculate the covariance estimate of the estimates; then, scale the covariance into a correlation matrix with values of 1 on the diagonal. You can perform these operations by using the following statements:

  
    > reset print;             /* turn on auto printing   */ 
    > covb=xpxi*mse;           /* covariance of estimates */ 
  
               COVB          3 rows      3 cols    (numeric) 
  
                           14.72     -10.56        1.6 
                          -10.56  8.5485714  -1.371429 
                             1.6  -1.371429  0.2285714 
  
    > s=1/sqrt(vecdiag(covb)); 
  
               S             3 rows      1 col     (numeric) 
  
                                  0.260643 
                                 0.3420214 
                                 2.0916501 
  
  
  
  
  
  
    > corrb=diag(s)*covb*diag(s);   /* correlation of estimates */ 
  
               CORRB         3 rows      3 cols    (numeric) 
  
                                1  -0.941376  0.8722784 
                        -0.941376          1  -0.981105 
                        0.8722784  -0.981105          1

Your module REGRESS remains available to do another regression - in this case, an orthogonalized version of the last polynomial example. In general, the columns of will not be orthogonal. You can use the ORPOL function to generate orthogonal polynomials for the regression. Using them provides greater computing accuracy and reduced computing times. When using orthogonal polynomial regression, you expect the statistics of fit to be the same and the estimates to be more stable and uncorrelated.

To perform an orthogonal regression on the data, you must first create a vector containing the values of the independent variable , which is the second column of the design matrix . Then, use the ORPOL function to generate orthogonal second degree polynomials. You can perform these operations by using the following statements:

  
    > x1={1,2,3,4,5};               /* second column of X */ 
  
               X1            5 rows      1 col     (numeric) 
  
                                         1 
                                         2 
                                         3 
                                         4 
                                         5 
  
  
    > x=orpol(x1,2);    /* generates orthogonal polynomials */ 
  
               X             5 rows      3 cols    (numeric) 
  
                        0.4472136  -0.632456  0.5345225 
                        0.4472136  -0.316228  -0.267261 
                        0.4472136          0  -0.534522 
                        0.4472136  0.3162278  -0.267261 
                        0.4472136  0.6324555  0.5345225 
  
    > reset noprint;           /* turns off auto printing   */ 
    > run regress;             /* run REGRESS */ 
  
  
                              Regression Results 
  
  
                        SSE       DFE       MSE   RSQUARE 
                        6.4         2       3.2 0.9923518 
  
  
  
  
                              Parameter Estimates 
  
  
                       BETA      STDB         T      PROB 
                  33.093806 1.7888544      18.5 0.0029091 
                  27.828043 1.7888544 15.556349 0.0041068 
                  7.4833148 1.7888544 4.1833001 0.0526691 
  
                                Y      YHAT     RESID 
                                1       1.2      -0.2 
                                5         4         1 
                                9      10.8      -1.8 
                               23      21.6       1.4 
                               36      36.4      -0.4 
  
    > reset print; 
    > covb=xpxi*mse; 
  
                COVB          3 rows      3 cols    (numeric) 
  
                               3.2 -2.73E-17 4.693E-16 
                         -2.73E-17       3.2 -2.18E-15 
                         4.693E-16 -2.18E-15       3.2 
  
  
    > s=1/sqrt(vecdiag(covb)); 
  
                S             3 rows      1 col     (numeric) 
  
                                    0.559017 
                                    0.559017 
                                    0.559017

  
    > corrb=diag(s)*covb*diag(s); 
  
                CORRB         3 rows      3 cols    (numeric) 
  
                                 1  -8.54E-18  1.467E-16 
                         -8.54E-18          1   -6.8E-16 
                         1.467E-16   -6.8E-16          1

Note that the values on the off-diagonal are displayed in scientific notation; the values are close to zero but not exactly zero because of the imprecision of floating-point arithmetic. To clean up the appearance of the correlation matrix, use the FUZZ option as follows:

  
    > reset fuzz; 
    > corrb=diag(s)*covb*diag(s); 
  
                CORRB         3 rows      3 cols    (numeric) 
  
                                 1         0         0 
                                 0         1         0 
                                 0         0         1

Top of Page