In addition to using continuous and categorical (CLASS) predictors in a model, you can also define a regression spline transformation for a predictor. A spline can be used when the effect of a predictor requires a very flexible representation. A spline can be more flexible than adding polynomial terms (square, cube, etc.) to the model. Spline effects can be defined using the EFFECT statement that is available in several procedures. See the discussion of splines in the documentation of the EFFECT statement in the Shared Concepts and Topics chapter of the SAS/STAT® User's Guide and in this note.
Like the CLASS statement, the EFFECT statement creates a set of design variables that replaces the original variable in the model. The set of design variables for a spline effect is known as a spline expansion that is defined using a set of basis functions. Three types of spline bases for creating an expansion are available in the EFFECT statement: B-spline, truncated power function, and natural cubic bases. The default is the B-spline basis that creates an expansion consisting of n+d+1 design variables, where n is the number of internal knots,Note1 and d is the degree or order of the spline transformation. The knots are values of the original variable, between which functions are defined. The functions are joined at the knots. Hastie et. al. (2001) provide details on the construction and use of spline expansions.
The following example fits a model using a spline transformation of a predictor and then scores the original data using the model parameter estimates after computing values of the B-spline basis. This is scoring method 4 described in this note. Predicted values for any new set of observations can be obtained in this way. However, the recommended and much easier way to score new observations using a model involving splines is to use one of scoring methods 1 (use PROC PLM), 2 (use built-in scoring capabilities, except the CODE statement), or 3 (augmentation). Scoring method 1 using PROC PLM is also shown in this example. See the Extrapolation section below for important information about scoring outside the range of the original (training) data. For models involving a mixture of continuous, CLASS, and spline effects, scoring using the model parameter estimates can be done by constructing a data set containing all of the necessary design variables that the parameter estimates must multiply.
The following statements create the aerobic fitness data set introduced as an example in the documentation of the REG procedure.
data fitness; input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@; datalines; 44 89.47 44.609 11.37 62 178 182 40 75.07 45.313 10.07 62 185 185 44 85.84 54.297 8.65 45 156 168 42 68.15 59.571 8.17 40 166 172 38 89.02 49.874 9.22 55 178 180 47 77.45 44.811 11.63 58 176 176 40 75.98 45.681 11.95 70 176 180 43 81.19 49.091 10.85 64 162 170 44 81.42 39.442 13.08 63 174 176 38 81.87 60.055 8.63 48 170 186 44 73.03 50.541 10.13 45 168 168 45 87.66 37.388 14.03 56 186 192 45 66.45 44.754 11.12 51 176 176 47 79.15 47.273 10.60 47 162 164 54 83.12 51.855 10.33 50 166 170 49 81.42 49.156 8.95 44 180 185 51 69.63 40.836 10.95 57 168 172 51 77.91 46.672 10.00 48 162 168 48 91.63 46.774 10.25 48 162 164 49 73.37 50.388 10.08 67 168 168 57 73.37 39.407 12.63 58 174 176 54 79.38 46.080 11.17 62 156 165 52 76.32 45.441 9.63 48 164 166 50 70.87 54.625 8.92 48 146 155 51 67.25 45.118 11.08 48 172 172 54 91.63 39.203 12.88 44 168 172 51 73.71 45.790 10.47 59 186 188 57 59.08 50.545 9.93 49 148 155 49 76.32 48.673 9.40 56 186 188 48 61.24 47.920 11.50 52 170 176 52 82.78 47.467 10.50 53 170 172 ;
These statements fit an ordinary least squares regression model of the Oxygen response variable. The ORTHOREG procedure is used, rather than PROC REG, since it supports the EFFECT (and EFFECTPLOT) statement. The EFFECT statement defines a spline transformation of the Weight predictor, which is then used as the predictor in the model. Since no spline options are specified, an expansion using a B-spline basis of degree three with three equally-spaced interior knots is produced. Note that there is no generally optimal way to specify the number of knots or knot locations. However, increasing the number of knots increases the flexibility of the spline and further localizes the behavior of the functions defined between the knots. A spline of degree three using three knots creates 3+3+1 = 7 basis functions. The EFFECTPLOT statement produces a plot of the fitted spline. The STORE statement saves the fitted model to be used in PROC PLM to obtain predicted values. The OUTEST= option is specified to save the parameter estimates of the model. The saved parameter estimates are later used to recreate the predicted values after first computing the B-spline basis function values.
proc orthoreg data=fitness outest=RegParms; effect sWeight=spline(Weight / details); model oxygen=sWeight; effectplot / obs nolimits; store out=model; run;
Following are the default knots used in defining the spline basis functions and the parameter estimates of the fitted model. The value in using a spline transformation is the flexibility it provides in modeling the response. This flexibility comes at the cost of largely removing any meaningful interpretation of the individual parameter estimates. Note the flexibility of the fitted spline in the plot from the EFFECTPLOT statement.
|
These statements score the original data set and save the predicted values in variable PLMpred in data set PLMscore. These predicted values are later compared to those computed using the parameter estimates of the model and the B-spline basis functions that are reproduced below.
proc plm source=model; score data=fitness out=PLMscore predicted=PLMpred; run;
These statements reproduce the B-spline basis functions that the EFFECT statement generates.Note2 Only the %LET statements at the beginning need to be altered for new examples. They specify the data set to be scored (PLMscore), the variable to which the spline transformation is applied (Weight), the number of internal knots (3), and the degree of the spline (3). Note that the PLMscore data set from PROC PLM is a copy of the original data set, FITNESS, plus the predicted response values. Either data set could be used, but using PLMscore allows the predicted values to be carried forward for comparison later. The range of the Weight variable is determined by PROC MEANS. The KNOTS DATA step determines the locations of the equally-spaced knots.Note3 The B-spline basis function values are then computed in the Bsplines DATA step. The basis function design variables are named b1, b2, ... .
%let var=Weight; %let data=PLMscore; %let nintknots=3; %let degree=3; proc means data=&data noprint; var &var; output out=minmax min=min max=max; run; %let nknots=%eval(&nintknots+2*(°ree+1)); data knots; /* KNOTMETHOD=EQUAL(&nintknots) */ set minmax; knotdiff=(max-min)/(&nintknots+1); array knot (&nknots) k1-k&nknots; do i=1 to &nknots; knot(i)=min+(i-°ree-1)*knotdiff; end; keep k1-k&nknots; run; data Bsplines; if _n_=1 then set knots; set &data; array knot (&nknots) k1-k&nknots; array bsp (%eval(&nknots-1)) b1-b%eval(&nknots-1); do o=1 to °ree+1; do i=1 to &nknots-o; if o=1 then bsp(i)=(knot(i)<=&var<knot(i+1)); else do; d1=(knot(i+o-1)-knot(i)); d2=(knot(i+o)-knot(i+1)); if d1=0 then tmp1=0; else tmp1=bsp(i)*(&var-knot(i))/d1; if d2=0 then tmp2=0; else tmp2=bsp(i+1)*(knot(i+o)-&var)/d2; bsp(i)=tmp1 + tmp2; end; end; end; drop o i d1 d2 tmp1 tmp2 b%eval(&nintknots+°ree+2)-b%eval(&nknots-1); run;
Scoring of the data is done in the following step. The vector of parameter estimates from the ORTHOREG OUTEST= option is read along with the data set of B-spline basis function values (Bsplines). Note that different modeling procedures use different names for the variables containing the parameter estimates.Note5 PROC ORTHOREG uses the names COL1, COL2, ... . The linear predictor (LP) is computed as usual by multiplying parameter estimates by corresponding basis function values and summing them with the intercept. For an ordinary regression model like this, the predicted value (PRED) is the linear predictor.Note4
data Score; if _n_=1 then set RegParms; set Bsplines; lp=intercept + col1*b1 + col2*b2 + col3*b3 + col4*b4 + col5*b5 + col6*b6 + col7*b7; Pred=lp; run;
These statements display the first ten observations of the Score data set showing that the predicted values produced by PROC PLM (PLMpred) are identical to the values produced by the above code (Pred).
proc print data=Score(obs=10) noobs; var PLMpred Pred; run;
|
Extrapolation
There are important considerations any time you predict values outside the range of the data used to train the model. This is known as extrapolation. For any regression model, whether it contains splines or not, extrapolation should be done with great caution. Since there is no information outside the range of the data, the shape of the model in that region is very uncertain. Simply extending the model beyond the data could result in very unreliable predictions. This warning can be made even more strongly for models involving splines. Since splines are used for their flexibility, the model can result in extreme predictions very quickly outside the data range. A way to lessen this anomalous behavior is to include the range of prediction in the selection of knots for the spline basis. However, doing so effectively changes the model from the original trained model.
To illustrate these issues in the example above, suppose you want to make predictions at weights between 10 and 150. Note that Weight only ranges from about 60 to 90 in the data. Predictions using the above model can be done using PROC PLM. These statements create a data set of records in this expanded range to be scored.
data newdata; do Weight=10 to 150; output; end; run;
This PLM step scores the data:
proc plm source=model; score data=newdata out=Bscores; run;
The following plot shows how each spline basis behaves when extrapolating outside the training data range. Notice how the predictions quickly go to extremes except in the case of the natural cubic spline basis which assumes linearity outside of the outer knots.
Note that the same predictions can also be obtained using the model parameter estimates and the computed basis function values rather than with PROC PLM. Predictions can be computed as above except that the knots of the original fit would be specified rather than recomputed.
You might obtain more reasonable behavior outside, but close to, the data range by including the expanded range of prediction when selecting knots that define the spline basis. This is easily done by adding the new data to the original and refitting the model to the augmented data. The following statements add observations with Weight values from 45 to 100 to the original data. Note that even though the response is missing for these added observations, the observations are still used by the modeling procedure for the purpose of knot selection. This is slightly different from the usual modeling procedure behavior of entirely ignoring observations when the response is missing. The EFFECTPLOT statement plots the predictions across the expanded range.
data newdata; do Weight=45 to 100; output; end; run; data augmented; set newdata fitness; run; proc orthoreg data=augmented; effect sWeight=spline(Weight / details); model oxygen=sWeight; effectplot / obs nolimits; run;
Notice that this is a different model than originally with different knots, different parameters, and a clearly different shape. In fact, changing the range of the added data, and therefore the knots, can dramatically alter the shape of the fitted curve.
|
_________________
Notes
Note 1: Only knots produced by the KNOTMETHOD=EQUAL spline option are shown. For knots specified by the KNOTMETHOD=LIST or LISTWITHBOUNDARY spline option, replace the KNOTS DATA step with a step that specifies the knots in variables K1, K2, ... .
Note 2: The following presents similar code to reproduce truncated power function and natural cubic spline basis functions. Note that appropriate changes would be needed in the DATA SCORE step if scoring is done.
The following code reproduces the truncated power function (TPF) spline basis functions created by this EFFECT statement: effect sWeight=spline(Weight / basis=tpf);
%let var=Weight; %let data=PLMscore; %let nknots=3; %let degree=3; proc means data=&data noprint; var &var; output out=minmax min=min max=max; run; data knots; /* KNOTMETHOD=EQUAL(&nknots) */ set minmax; knotdiff=(max-min)/(&nknots+1); array knot (&nknots) k1-k&nknots; do i=1 to &nknots; knot(i)=min+i*knotdiff; end; keep k1-k&nknots; run; data TPFsplines; if _n_=1 then set knots; set &data; array knot (&nknots) k1-k&nknots; array t (%eval(&nknots+°ree+1)) tpf1-tpf%eval(&nknots+°ree+1); do i=0 to °ree; t(i+1)=&var**i; end; do i=1 to &nknots; t(°ree+1+i)=max(0,&var-knot(i))**°ree; end; drop i; run;
The following code reproduces the natural cubic spline basis functions created by this EFFECT statement: effect sWeight=spline(Weight / naturalcubic);
%let var=Weight; %let data=PLMscore; %let nknots=3; proc means data=&data noprint; var &var; output out=minmax min=min max=max; run; data knots; /* KNOTMETHOD=EQUAL(&nknots) */ set minmax; knotdiff=(max-min)/(&nknots+1); array knot (&nknots) k1-k&nknots; do i=1 to &nknots; knot(i)=min+i*knotdiff; end; keep k1-k&nknots; run; data NCsplines; if _n_=1 then set knots; set &data; array knot (&nknots) k1-k&nknots; array nc (&nknots) nc1-nc&nknots; array t (%eval(&nknots+4)) tpf1-tpf%eval(&nknots+4); nc1=1; nc2=&var; do i=1 to &nknots; t(i+4)=max(0,&var-knot(i))**3; end; do i=1 to &nknots-2; nc(i+2)=(t(i+4)-t(&nknots+4))/(knot(&nknots)-knot(i)) - (t(&nknots+3)-t(&nknots+4))/(knot(&nknots)-knot(&nknots-1)); end; drop tpf: i; run;
Note 3: If the DATABOUNDARY spline option is specified with BASIS=BSPLINE, the following DATA step should be used to create the KNOTS data set.
data knots; /* KNOTMETHOD=EQUAL(&nintknots) DATABOUNDARY */ set minmax; knotdiff=(max-min)/(&nintknots+1); array knot (&nknots) k1-k%eval(&nknots); do i=1 to &nknots; knot(i)=min(max(min,min+(i-°ree-1)*knotdiff),max); end; keep k1-k%eval(&nknots); run;
Note 4: For generalized linear models like logistic or Poisson models, the inverse of the link function must be applied to the linear predictor to obtain the predicted value on the mean scale. For such models, replace
pred=lp;
in the DATA SCORE step with one of the following. For a generalized linear model with binomial distribution and logit link (a logistic model), specify
pred=logistic(lp);
For a binomial model with probit link, specify
pred=probnorm(lp);
For a log-linked model, such as often used with the Poisson, negative binomial, and gamma distributions, specify
pred=exp(lp);
Note5: For modeling procedures that do not offer an OUTEST= option for saving parameter estimates, use an ODS OUTPUT statement instead. Adding the following statement in the modeling step saves the table of parameter estimates to data set PE:
ods output ParameterEstimates=pe;
In order to use the saved parameter estimates in the DATA SCORE step, transpose the column of estimates in the PE data set. For the example above, these statements transpose the Estimate column and use the estimate names in the Parameter column to name the columns in the transposed data set, TPE. The DATA SCORE step can then do the scoring by using the TPE data set and the column names for the parameters.
proc transpose data=pe out=tpe; var estimate; id parameter; run; data Score; if _n_=1 then set tpe; set Bsplines; lp=intercept + sWeight_1*b1 + sWeight_2*b2 + sWeight_3*b3 + sWeight_4*b4 + sWeight_5*b5 + sWeight_6*b6 + sWeight_7*b7; Pred=lp; run;
_________________
Hastie, T. J., Tibshirani, R. J., and Friedman, J. H. (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, New York: Springer-Verlag.
Product Family | Product | System | SAS Release | |
Reported | Fixed* | |||
SAS System | SAS/STAT | z/OS | ||
z/OS 64-bit | ||||
OpenVMS VAX | ||||
Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
Microsoft Windows XP 64-bit Edition | ||||
Microsoft® Windows® for x64 | ||||
OS/2 | ||||
Microsoft Windows 8 Enterprise 32-bit | ||||
Microsoft Windows 8 Enterprise x64 | ||||
Microsoft Windows 8 Pro 32-bit | ||||
Microsoft Windows 8 Pro x64 | ||||
Microsoft Windows 8.1 Enterprise 32-bit | ||||
Microsoft Windows 8.1 Enterprise x64 | ||||
Microsoft Windows 8.1 Pro | ||||
Microsoft Windows 8.1 Pro 32-bit | ||||
Microsoft Windows 10 | ||||
Microsoft Windows 95/98 | ||||
Microsoft Windows 2000 Advanced Server | ||||
Microsoft Windows 2000 Datacenter Server | ||||
Microsoft Windows 2000 Server | ||||
Microsoft Windows 2000 Professional | ||||
Microsoft Windows NT Workstation | ||||
Microsoft Windows Server 2003 Datacenter Edition | ||||
Microsoft Windows Server 2003 Enterprise Edition | ||||
Microsoft Windows Server 2003 Standard Edition | ||||
Microsoft Windows Server 2003 for x64 | ||||
Microsoft Windows Server 2008 | ||||
Microsoft Windows Server 2008 R2 | ||||
Microsoft Windows Server 2008 for x64 | ||||
Microsoft Windows Server 2012 Datacenter | ||||
Microsoft Windows Server 2012 R2 Datacenter | ||||
Microsoft Windows Server 2012 R2 Std | ||||
Microsoft Windows Server 2012 Std | ||||
Microsoft Windows XP Professional | ||||
Windows 7 Enterprise 32 bit | ||||
Windows 7 Enterprise x64 | ||||
Windows 7 Home Premium 32 bit | ||||
Windows 7 Home Premium x64 | ||||
Windows 7 Professional 32 bit | ||||
Windows 7 Professional x64 | ||||
Windows 7 Ultimate 32 bit | ||||
Windows 7 Ultimate x64 | ||||
Windows Millennium Edition (Me) | ||||
Windows Vista | ||||
Windows Vista for x64 | ||||
64-bit Enabled AIX | ||||
64-bit Enabled HP-UX | ||||
64-bit Enabled Solaris | ||||
ABI+ for Intel Architecture | ||||
AIX | ||||
HP-UX | ||||
HP-UX IPF | ||||
IRIX | ||||
Linux | ||||
Linux for x64 | ||||
Linux on Itanium | ||||
OpenVMS Alpha | ||||
OpenVMS on HP Integrity | ||||
Solaris | ||||
Solaris for x64 | ||||
Tru64 UNIX |
Type: | Usage Note |
Priority: | |
Topic: | SAS Reference ==> Procedures ==> GLIMMIX SAS Reference ==> Procedures ==> GLMSELECT SAS Reference ==> Procedures ==> HPMIXED SAS Reference ==> Procedures ==> LOGISTIC SAS Reference ==> Procedures ==> ORTHOREG SAS Reference ==> Procedures ==> PHREG SAS Reference ==> Procedures ==> PLS SAS Reference ==> Procedures ==> QUANTLIFE SAS Reference ==> Procedures ==> QUANTREG SAS Reference ==> Procedures ==> QUANTSELECT SAS Reference ==> Procedures ==> ROBUSTREG SAS Reference ==> Procedures ==> SURVEYLOGISTIC SAS Reference ==> Procedures ==> SURVEYREG Analytics ==> Regression Analytics ==> Transformations |
Date Modified: | 2018-07-23 11:26:53 |
Date Created: | 2016-02-17 15:18:22 |