A Module for Linear Regression

The linear systems that arise naturally in statistics are usually overconstrained, meaning that the $\mb {X}$ matrix has more rows than columns and that an exact solution to the linear system is impossible to find. Instead, the statistician assumes a linear model of the form

\[  \mb {y} = \mb {Xb} + \mb {e}  \]

where $\mb {y}$ is the vector of responses, $\mb {X}$ is a design matrix, and $\mb {b}$ is a vector of unknown parameters that are estimated by minimizing the sum of squares of $\mb {e}$, the error or residual term.

The following example illustrates some programming techniques by using SAS/IML statements to perform linear regression. (The example module does not replace regression procedures such as the REG procedure, which are more efficient for regressions and offer a multitude of diagnostic options.)

Suppose you have response data $\mb {y}$ measured at five values of the independent variable $\mb {X}$ and you want to perform a quadratic regression. In this case, you can define the design matrix $\mb {X}$ and the data vector $\mb {y}$ as follows:

proc iml;
x = {1 1 1,
     1 2 4,
     1 3 9,
     1 4 16,
     1 5 25};
y = {1, 5, 9, 23, 36};

You can compute the least squares estimate of $\mb {b}$ by using the following statement:

b = inv(x`*x) * x`*y;
print b;

Figure 4.2: Parameter Estimates

b
2.4
-3.2
2


The predicted values are found by multiplying the data matrix and the parameter estimates; the residuals are the differences between actual and predicted responses, as shown in the following statements:

yhat = x*b;
r = y-yhat;
print yhat r;

Figure 4.3: Predicted and Residual Values

yhat r
1.2 -0.2
4 1
10.8 -1.8
21.6 1.4
36.4 -0.4


To estimate the variance of the responses, calculate the sum of squared errors (SSE), the error degrees of freedom (DFE), and the mean squared error (MSE) as follows:

sse = ssq(r);
dfe = nrow(x)-ncol(x);
mse = sse/dfe;
print sse dfe mse;

Figure 4.4: Statistics for a Linear Model

sse dfe mse
6.4 2 3.2


Notice that in computing the degrees of freedom, you use the function NCOL to return the number of columns of $\mb {X}$ and the function NROW to return the number of rows.

Now suppose you want to solve the problem repeatedly on new data. To do this, you can define a module. Modules begin with a START statement and end with a FINISH statement, with the program statements in between. The following statements define a module named Regress to perform linear regression:

start Regress;                    /* begins module        */
  xpxi = inv(x`*x);               /* inverse of X'X      */
  beta = xpxi * (x`*y);           /* parameter estimate  */
  yhat = x*beta;                  /* predicted values    */
  resid = y-yhat;                 /* residuals           */

  sse = ssq(resid);               /* SSE                 */
  n = nrow(x);                    /* sample size         */
  dfe = nrow(x)-ncol(x);          /* error DF            */
  mse = sse/dfe;                  /* MSE                 */
  cssy = ssq(y-sum(y)/n);         /* corrected total SS  */
  rsquare = (cssy-sse)/cssy;      /* RSQUARE             */
  print ,"Regression Results", sse dfe mse rsquare;

  stdb = sqrt(vecdiag(xpxi)*mse); /* std of estimates    */
  t = beta/stdb;                  /* parameter t tests   */
  prob = 1-probf(t#t,1,dfe);      /* p-values            */
  print ,"Parameter Estimates",, beta stdb t prob;
  print ,y yhat resid;
finish Regress;                   /* ends module          */

Assuming that the matrices x and y are defined, you can run the Regress module as follows:

run Regress;                     /* executes module      */

Figure 4.5: The Results of a Regression Module

Regression Results

sse dfe mse rsquare
6.4 2 3.2 0.9923518

Parameter Estimates

beta stdb t prob
2.4 3.8366652 0.6255432 0.5954801
-3.2 2.923794 -1.094468 0.387969
2 0.4780914 4.1833001 0.0526691

y yhat resid
1 1.2 -0.2
5 4 1
9 10.8 -1.8
23 21.6 1.4
36 36.4 -0.4