Tutorial: A Module for Linear Regression |
The previous method might be more familiar to statisticians when different notation is used. A linear model is usually written as
The following example illustrates the programming techniques involved in performing linear regression. It is not meant to replace regression procedures such as the REG procedure, which are more efficient for regressions and offer a multitude of diagnostic options.
Suppose you have response data measured at five values of the independent variable and you want to perform a quadratic regression.
Submit the PROC IML statement to begin the procedure:
> proc iml; IML Ready
Input the design matrix and the data vector as matrix literals:
> x={1 1 1, > 1 2 4, > 1 3 9, > 1 4 16, > 1 5 25}; X 5 rows 3 cols (numeric) 1 1 1 1 2 4 1 3 9 1 4 16 1 5 25 > y={1,5,9,23,36}; Y 5 rows 1 col (numeric) 1 5 9 23 36
Compute the least-squares estimate of by using the traditional formula:
> b=inv(x`*x)*x`*y; B 3 rows 1 col (numeric) 2.4 -3.2 2
The predicted values are simply the matrix multiplied by the parameter estimates, and the residuals are the difference between actual and predicted , as follows:
> yhat=x*b; YHAT 5 rows 1 col (numeric) 1.2 4 10.8 21.6 36.4
> r=y-yhat; R 5 rows 1 col (numeric) -0.2 1 -1.8 1.4 -0.4To calculate the estimate of the variance of the responses, calculate the sum of squared errors (SSE), its degrees of freedom (DFE), and the mean squared error (MSE) as follows. Note that in computing the degrees, you use the function NCOL to return the number of columns of .
> sse=ssq(r); SSE 1 row 1 col (numeric) 6.4 > dfe=nrow(x)-ncol(x); DFE 1 row 1 col (numeric) 2 > mse=sse/dfe; MSE 1 row 1 col (numeric) 3.2Notice that each calculation has required one simple line of code.
Now suppose you want to solve the problem repeatedly on new data sets without reentering the code. To do this, define a module (or subroutine). Modules begin with a START statement and end with a FINISH statement, with the program statements in between. The following statements define a module named REGRESS to perform linear regression:
> start regress; /* begin module */ > xpxi=inv(t(x)*x); /* inverse of X'X */ > beta=xpxi*(t(x)*y); /* parameter estimate */ > yhat=x*beta; /* predicted values */ > resid=y-yhat; /* residuals */ > sse=ssq(resid); /* SSE */ > n=nrow(x); /* sample size */ > dfe=nrow(x)-ncol(x); /* error DF */ > mse=sse/dfe; /* MSE */ > cssy=ssq(y-sum(y)/n); /* corrected total SS */ > rsquare=(cssy-sse)/cssy; /* RSQUARE */ > print,"Regression Results", > sse dfe mse rsquare; > stdb=sqrt(vecdiag(xpxi)*mse); /* std of estimates */ > t=beta/stdb; /* parameter t tests */ > prob=1-probf(t#t,1,dfe); /* p-values */ > print,"Parameter Estimates",, > beta stdb t prob; > print,y yhat resid; > finish regress; /* end module */
Submit the module REGRESS for execution as follows:
> reset noprint; > run regress; /* execute module */ Regression Results SSE DFE MSE RSQUARE 6.4 2 3.2 0.9923518 Parameter Estimates BETA STDB T PROB 2.4 3.8366652 0.6255432 0.5954801 -3.2 2.9237940 -1.094468 0.3879690 2 0.4780914 4.1833001 0.0526691 Y YHAT RESID 1 1.2 -0.2 5 4 1 9 10.8 -1.8 23 21.6 1.4 36 36.4 -0.4
At this point, you still have all of the matrices defined if you want to continue calculations. Suppose you want to correlate the estimates. First, calculate the covariance estimate of the estimates; then, scale the covariance into a correlation matrix with values of 1 on the diagonal. You can perform these operations by using the following statements:
> reset print; /* turn on auto printing */ > covb=xpxi*mse; /* covariance of estimates */ COVB 3 rows 3 cols (numeric) 14.72 -10.56 1.6 -10.56 8.5485714 -1.371429 1.6 -1.371429 0.2285714 > s=1/sqrt(vecdiag(covb)); S 3 rows 1 col (numeric) 0.260643 0.3420214 2.0916501 > corrb=diag(s)*covb*diag(s); /* correlation of estimates */ CORRB 3 rows 3 cols (numeric) 1 -0.941376 0.8722784 -0.941376 1 -0.981105 0.8722784 -0.981105 1
Your module REGRESS remains available to do another regression - in this case, an orthogonalized version of the last polynomial example. In general, the columns of will not be orthogonal. You can use the ORPOL function to generate orthogonal polynomials for the regression. Using them provides greater computing accuracy and reduced computing times. When using orthogonal polynomial regression, you expect the statistics of fit to be the same and the estimates to be more stable and uncorrelated.
To perform an orthogonal regression on the data, you must first create a vector containing the values of the independent variable , which is the second column of the design matrix . Then, use the ORPOL function to generate orthogonal second degree polynomials. You can perform these operations by using the following statements:
> x1={1,2,3,4,5}; /* second column of X */ X1 5 rows 1 col (numeric) 1 2 3 4 5 > x=orpol(x1,2); /* generates orthogonal polynomials */ X 5 rows 3 cols (numeric) 0.4472136 -0.632456 0.5345225 0.4472136 -0.316228 -0.267261 0.4472136 0 -0.534522 0.4472136 0.3162278 -0.267261 0.4472136 0.6324555 0.5345225 > reset noprint; /* turns off auto printing */ > run regress; /* run REGRESS */ Regression Results SSE DFE MSE RSQUARE 6.4 2 3.2 0.9923518 Parameter Estimates BETA STDB T PROB 33.093806 1.7888544 18.5 0.0029091 27.828043 1.7888544 15.556349 0.0041068 7.4833148 1.7888544 4.1833001 0.0526691 Y YHAT RESID 1 1.2 -0.2 5 4 1 9 10.8 -1.8 23 21.6 1.4 36 36.4 -0.4 > reset print; > covb=xpxi*mse; COVB 3 rows 3 cols (numeric) 3.2 -2.73E-17 4.693E-16 -2.73E-17 3.2 -2.18E-15 4.693E-16 -2.18E-15 3.2 > s=1/sqrt(vecdiag(covb)); S 3 rows 1 col (numeric) 0.559017 0.559017 0.559017
> corrb=diag(s)*covb*diag(s); CORRB 3 rows 3 cols (numeric) 1 -8.54E-18 1.467E-16 -8.54E-18 1 -6.8E-16 1.467E-16 -6.8E-16 1
Note that the values on the off-diagonal are displayed in scientific notation; the values are close to zero but not exactly zero because of the imprecision of floating-point arithmetic. To clean up the appearance of the correlation matrix, use the FUZZ option as follows:
> reset fuzz; > corrb=diag(s)*covb*diag(s); CORRB 3 rows 3 cols (numeric) 1 0 0 0 1 0 0 0 1
Copyright © 2009 by SAS Institute Inc., Cary, NC, USA. All rights reserved.