Previous Page | Next Page

A Module for Linear Regression

The linear systems that arise naturally in statistics are usually overconstrained, meaning that the $\text{[math]}$ matrix has more rows than columns and that an exact solution to the linear system is impossible to find. Instead, the statistician assumes a linear model of the form

$\text{[math]}$

where $\text{[math]}$ is the vector of responses, $\text{[math]}$ is a design matrix, and $\text{[math]}$ is a vector of unknown parameters that are estimated by minimizing the sum of squares of $\text{[math]}$ , the error or residual term.

The following example illustrates some programming techniques by using SAS/IML statements to perform linear regression. (The example module does not replace regression procedures such as the REG procedure, which are more efficient for regressions and offer a multitude of diagnostic options.)

Suppose you have response data $\text{[math]}$ measured at five values of the independent variable $\text{[math]}$ and you want to perform a quadratic regression. In this case, you can define the design matrix $\text{[math]}$ and the data vector $\text{[math]}$ as follows:

proc iml;
x = {1 1 1,
     1 2 4,
     1 3 9,
     1 4 16,
     1 5 25};
y = {1, 5, 9, 23, 36};

You can compute the least squares estimate of $\text{[math]}$ by using the following statement:

b = inv(x`*x) * x`*y;
print b;

Figure 4.2 Parameter Estimates

b
2.4
-3.2
2

The predicted values are found by multiplying the data matrix and the parameter estimates; the residuals are the differences between actual and predicted responses, as shown in the following statements:

yhat = x*b;
r = y-yhat;
print yhat r;

Figure 4.3 Predicted and Residual Values

yhat	r
1.2	-0.2
4	1
10.8	-1.8
21.6	1.4
36.4	-0.4

To estimate the variance of the responses, calculate the sum of squared errors (SSE), the error degrees of freedom (DFE), and the mean squared error (MSE) as follows:

sse = ssq(r);
dfe = nrow(x)-ncol(x);
mse = sse/dfe;
print sse dfe mse;

Figure 4.4 Statistics for a Linear Model

sse	dfe	mse
6.4	2	3.2

Notice that in computing the degrees of freedom, you use the function NCOL to return the number of columns of $\text{[math]}$ and the function NROW to return the number of rows.

Now suppose you want to solve the problem repeatedly on new data. To do this, you can define a module. Modules begin with a START statement and end with a FINISH statement, with the program statements in between. The following statements define a module named Regress to perform linear regression:

start Regress;                    /* begins module        */
  xpxi = inv(x`*x);               /* inverse of X'X      */
  beta = xpxi * (x`*y);           /* parameter estimate  */
  yhat = x*beta;                  /* predicted values    */
  resid = y-yhat;                 /* residuals           */

  sse = ssq(resid);               /* SSE                 */
  n = nrow(x);                    /* sample size         */
  dfe = nrow(x)-ncol(x);          /* error DF            */
  mse = sse/dfe;                  /* MSE                 */
  cssy = ssq(y-sum(y)/n);         /* corrected total SS  */
  rsquare = (cssy-sse)/cssy;      /* RSQUARE             */
  print ,"Regression Results", sse dfe mse rsquare;

  stdb = sqrt(vecdiag(xpxi)*mse); /* std of estimates    */
  t = beta/stdb;                  /* parameter t tests   */
  prob = 1-probf(t#t,1,dfe);      /* p-values            */
  print ,"Parameter Estimates",, beta stdb t prob;
  print ,y yhat resid;
finish Regress;                   /* ends module          */

Assuming that the matrices x and y are defined, you can run the Regress module as follows:

run Regress;                     /* executes module      */

Figure 4.5 The Results of a Regression Module

Regression Results

sse	dfe	mse	rsquare
6.4	2	3.2	0.9923518

Parameter Estimates

beta	stdb	t	prob
2.4	3.8366652	0.6255432	0.5954801
-3.2	2.923794	-1.094468	0.387969
2	0.4780914	4.1833001	0.0526691

y	yhat	resid
1	1.2	-0.2
5	4	1
9	10.8	-1.8
23	21.6	1.4
36	36.4	-0.4