SUPPORT / SAMPLES & SAS NOTES
 

Support

Usage Note 57682: Scoring a model containing spline effects

DetailsAboutRate It

In addition to using continuous and categorical (CLASS) predictors in a model, you can also define a regression spline transformation for a predictor. A spline can be used when the effect of a predictor requires a very flexible representation. A spline can be more flexible than adding polynomial terms (square, cube, etc.) to the model. Spline effects can be defined using the EFFECT statement that is available in several procedures. See the discussion of splines in the documentation of the EFFECT statement in the Shared Concepts and Topics chapter of the SAS/STAT® User's Guide and in this note.

Like the CLASS statement, the EFFECT statement creates a set of design variables that replaces the original variable in the model. The set of design variables for a spline effect is known as a spline expansion that is defined using a set of basis functions. Three types of spline bases for creating an expansion are available in the EFFECT statement: B-spline, truncated power function, and natural cubic bases. The default is the B-spline basis that creates an expansion consisting of n+d+1 design variables, where n is the number of internal knots,Note1 and d is the degree or order of the spline transformation. The knots are values of the original variable, between which functions are defined. The functions are joined at the knots. Hastie et. al. (2001) provide details on the construction and use of spline expansions.

The following example fits a model using a spline transformation of a predictor and then scores the original data using the model parameter estimates after computing values of the B-spline basis. This is scoring method 4 described in this note. Predicted values for any new set of observations can be obtained in this way. However, the recommended and much easier way to score new observations using a model involving splines is to use one of scoring methods 1 (use PROC PLM), 2 (use built-in scoring capabilities, except the CODE statement), or 3 (augmentation). Scoring method 1 using PROC PLM is also shown in this example. See the Extrapolation section below for important information about scoring outside the range of the original (training) data. For models involving a mixture of continuous, CLASS, and spline effects, scoring using the model parameter estimates can be done by constructing a data set containing all of the necessary design variables that the parameter estimates must multiply.

The following statements create the aerobic fitness data set introduced as an example in the documentation of the REG procedure.

      data fitness;
         input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@;
         datalines;
      44 89.47 44.609 11.37 62 178 182   40 75.07 45.313 10.07 62 185 185
      44 85.84 54.297  8.65 45 156 168   42 68.15 59.571  8.17 40 166 172
      38 89.02 49.874  9.22 55 178 180   47 77.45 44.811 11.63 58 176 176
      40 75.98 45.681 11.95 70 176 180   43 81.19 49.091 10.85 64 162 170
      44 81.42 39.442 13.08 63 174 176   38 81.87 60.055  8.63 48 170 186
      44 73.03 50.541 10.13 45 168 168   45 87.66 37.388 14.03 56 186 192
      45 66.45 44.754 11.12 51 176 176   47 79.15 47.273 10.60 47 162 164
      54 83.12 51.855 10.33 50 166 170   49 81.42 49.156  8.95 44 180 185
      51 69.63 40.836 10.95 57 168 172   51 77.91 46.672 10.00 48 162 168
      48 91.63 46.774 10.25 48 162 164   49 73.37 50.388 10.08 67 168 168
      57 73.37 39.407 12.63 58 174 176   54 79.38 46.080 11.17 62 156 165
      52 76.32 45.441  9.63 48 164 166   50 70.87 54.625  8.92 48 146 155
      51 67.25 45.118 11.08 48 172 172   54 91.63 39.203 12.88 44 168 172
      51 73.71 45.790 10.47 59 186 188   57 59.08 50.545  9.93 49 148 155
      49 76.32 48.673  9.40 56 186 188   48 61.24 47.920 11.50 52 170 176
      52 82.78 47.467 10.50 53 170 172
      ;

These statements fit an ordinary least squares regression model of the Oxygen response variable. The ORTHOREG procedure is used, rather than PROC REG, since it supports the EFFECT (and EFFECTPLOT) statement. The EFFECT statement defines a spline transformation of the Weight predictor, which is then used as the predictor in the model. Since no spline options are specified, an expansion using a B-spline basis of degree three with three equally-spaced interior knots is produced. Note that there is no generally optimal way to specify the number of knots or knot locations. However, increasing the number of knots increases the flexibility of the spline and further localizes the behavior of the functions defined between the knots. A spline of degree three using three knots creates 3+3+1 = 7 basis functions. The EFFECTPLOT statement produces a plot of the fitted spline. The STORE statement saves the fitted model to be used in PROC PLM to obtain predicted values. The OUTEST= option is specified to save the parameter estimates of the model. The saved parameter estimates are later used to recreate the predicted values after first computing the B-spline basis function values.

      proc orthoreg data=fitness outest=RegParms;
        effect sWeight=spline(Weight / details);
        model oxygen=sWeight;
        effectplot / obs nolimits;
        store out=model;
        run;

Following are the default knots used in defining the spline basis functions and the parameter estimates of the fitted model. The value in using a spline transformation is the flexibility it provides in modeling the response. This flexibility comes at the cost of largely removing any meaningful interpretation of the individual parameter estimates. Note the flexibility of the fitted spline in the plot from the EFFECTPLOT statement.

Knots for Spline Effect sWeight
Knot Number Boundary Weight
1 * 42.80500
2 * 50.94250
3 * 59.08000
4   67.21750
5   75.35500
6   83.49250
7 * 91.63000
8 * 99.76750
9 * 107.90500
 
Parameter DF Parameter Estimate Standard
Error
t Value Pr > |t|
Intercept 1 47.2672241399077 52.137189006 0.91 0.3736
sWeight 1 1 76.6706881423887 97.493166744 0.79 0.4393
sWeight 2 1 -15.1294883745276 58.000993554 -0.26 0.7964
sWeight 3 1 8.36518540265234 51.050148052 0.16 0.8712
sWeight 4 1 -6.23613378688783 54.051515052 -0.12 0.9091
sWeight 5 1 7.39727387448752 49.207131031 0.15 0.8818
sWeight 6 1 -8.06090359125194 62.338460584 -0.13 0.8982
sWeight 7 0 0 . . .

Fit Plot for Oxygen by Weight

These statements score the original data set and save the predicted values in variable PLMpred in data set PLMscore. These predicted values are later compared to those computed using the parameter estimates of the model and the B-spline basis functions that are reproduced below.

      proc plm source=model;
        score data=fitness out=PLMscore predicted=PLMpred;
        run;

These statements reproduce the B-spline basis functions that the EFFECT statement generates.Note2 Only the %LET statements at the beginning need to be altered for new examples. They specify the data set to be scored (PLMscore), the variable to which the spline transformation is applied (Weight), the number of internal knots (3), and the degree of the spline (3). Note that the PLMscore data set from PROC PLM is a copy of the original data set, FITNESS, plus the predicted response values. Either data set could be used, but using PLMscore allows the predicted values to be carried forward for comparison later. The range of the Weight variable is determined by PROC MEANS. The KNOTS DATA step determines the locations of the equally-spaced knots.Note3 The B-spline basis function values are then computed in the Bsplines DATA step. The basis function design variables are named b1, b2, ... .

      %let var=Weight;
      %let data=PLMscore;
      %let nintknots=3;
      %let degree=3;

      proc means data=&data noprint; 
        var &var; 
        output out=minmax min=min max=max; 
        run;
      
      %let nknots=%eval(&nintknots+2*(&degree+1));      
      data knots; /* KNOTMETHOD=EQUAL(&nintknots) */
        set minmax;
        knotdiff=(max-min)/(&nintknots+1);
        array knot (&nknots) k1-k&nknots;
        do i=1 to &nknots;  
          knot(i)=min+(i-&degree-1)*knotdiff;
        end;
        keep k1-k&nknots;
        run;
      
      data Bsplines;
        if _n_=1 then set knots;
        set &data;
        array knot (&nknots) k1-k&nknots;
        array bsp (%eval(&nknots-1)) b1-b%eval(&nknots-1); 
        do o=1 to &degree+1;
          do i=1 to &nknots-o;
            if o=1 then bsp(i)=(knot(i)<=&var<knot(i+1));
            else do;
              d1=(knot(i+o-1)-knot(i));
              d2=(knot(i+o)-knot(i+1));
              if d1=0 then tmp1=0; else tmp1=bsp(i)*(&var-knot(i))/d1;
              if d2=0 then tmp2=0; else tmp2=bsp(i+1)*(knot(i+o)-&var)/d2;
              bsp(i)=tmp1 + tmp2;
            end;
          end;
        end;
        drop o i d1 d2 tmp1 tmp2 b%eval(&nintknots+&degree+2)-b%eval(&nknots-1);
        run;

Scoring of the data is done in the following step. The vector of parameter estimates from the ORTHOREG OUTEST= option is read along with the data set of B-spline basis function values (Bsplines). Note that different modeling procedures use different names for the variables containing the parameter estimates.Note5 PROC ORTHOREG uses the names COL1, COL2, ... . The linear predictor (LP) is computed as usual by multiplying parameter estimates by corresponding basis function values and summing them with the intercept. For an ordinary regression model like this, the predicted value (PRED) is the linear predictor.Note4

      data Score;
        if _n_=1 then set RegParms;
        set Bsplines;
        lp=intercept + col1*b1 + col2*b2 + col3*b3 + col4*b4 + col5*b5 + col6*b6 + col7*b7;
        Pred=lp;
        run;

These statements display the first ten observations of the Score data set showing that the predicted values produced by PROC PLM (PLMpred) are identical to the values produced by the above code (Pred).

      proc print data=Score(obs=10) noobs;
        var PLMpred Pred;
        run;
PLMpred Pred
44.7725 44.7725
45.7707 45.7707
48.5527 48.5527
49.5591 49.5591
45.2329 45.2329
46.3850 46.3850
45.7787 45.7787
49.1260 49.1260
49.2626 49.2626
49.4953 49.4953

Extrapolation

There are important considerations any time you predict values outside the range of the data used to train the model. This is known as extrapolation. For any regression model, whether it contains splines or not, extrapolation should be done with great caution. Since there is no information outside the range of the data, the shape of the model in that region is very uncertain. Simply extending the model beyond the data could result in very unreliable predictions. This warning can be made even more strongly for models involving splines. Since splines are used for their flexibility, the model can result in extreme predictions very quickly outside the data range. A way to lessen this anomalous behavior is to include the range of prediction in the selection of knots for the spline basis. However, doing so effectively changes the model from the original trained model.

To illustrate these issues in the example above, suppose you want to make predictions at weights between 10 and 150. Note that Weight only ranges from about 60 to 90 in the data. Predictions using the above model can be done using PROC PLM. These statements create a data set of records in this expanded range to be scored.

    data newdata; 
        do Weight=10 to 150; 
          output;
        end; 
        run;

This PLM step scores the data:

      proc plm source=model;
        score data=newdata out=Bscores;
        run;

The following plot shows how each spline basis behaves when extrapolating outside the training data range. Notice how the predictions quickly go to extremes except in the case of the natural cubic spline basis which assumes linearity outside of the outer knots.

Extrapolation with 3 spline types

Note that the same predictions can also be obtained using the model parameter estimates and the computed basis function values rather than with PROC PLM. Predictions can be computed as above except that the knots of the original fit would be specified rather than recomputed.

You might obtain more reasonable behavior outside, but close to, the data range by including the expanded range of prediction when selecting knots that define the spline basis. This is easily done by adding the new data to the original and refitting the model to the augmented data. The following statements add observations with Weight values from 45 to 100 to the original data. Note that even though the response is missing for these added observations, the observations are still used by the modeling procedure for the purpose of knot selection. This is slightly different from the usual modeling procedure behavior of entirely ignoring observations when the response is missing. The EFFECTPLOT statement plots the predictions across the expanded range.

      data newdata; 
        do Weight=45 to 100; 
          output;
        end; 
        run;
      data augmented;
        set newdata fitness;
        run;
      proc orthoreg data=augmented;
        effect sWeight=spline(Weight / details);
        model oxygen=sWeight;
        effectplot / obs nolimits;
        run;

Notice that this is a different model than originally with different knots, different parameters, and a clearly different shape. In fact, changing the range of the added data, and therefore the knots, can dramatically alter the shape of the fitted curve.

Knots for Spline Effect sWeight
Knot Number Boundary Weight
1 * 17.50000
2 * 31.25000
3 * 45.00000
4   58.75000
5   72.50000
6   86.25000
7 * 100.00000
8 * 113.75000
9 * 127.50000
 
Parameter DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept 1 166.831271440004 933.22217222 0.18 0.8596
sWeight 1 0 0 . . .
sWeight 2 1 -152.804110216564 959.1570797 -0.16 0.8747
sWeight 3 1 -107.590331151236 927.46992946 -0.12 0.9086
sWeight 4 1 -125.887558044974 937.03357428 -0.13 0.8942
sWeight 5 1 -110.934309373573 927.43078731 -0.12 0.9057
sWeight 6 1 -143.454903205023 958.37129694 -0.15 0.8822
sWeight 7 0 0 . . .

Extrapolation with new knots

_________________

Notes

Note 1: Only knots produced by the KNOTMETHOD=EQUAL spline option are shown. For knots specified by the KNOTMETHOD=LIST or LISTWITHBOUNDARY spline option, replace the KNOTS DATA step with a step that specifies the knots in variables K1, K2, ... .

Note 2: The following presents similar code to reproduce truncated power function and natural cubic spline basis functions. Note that appropriate changes would be needed in the DATA SCORE step if scoring is done.

The following code reproduces the truncated power function (TPF) spline basis functions created by this EFFECT statement: effect sWeight=spline(Weight / basis=tpf);

      %let var=Weight;
      %let data=PLMscore;
      %let nknots=3;
      %let degree=3;
      
      proc means data=&data noprint; 
      var &var; 
      output out=minmax min=min max=max; 
      run;
      
      data knots; /* KNOTMETHOD=EQUAL(&nknots) */
      set minmax;
      knotdiff=(max-min)/(&nknots+1);
      array knot (&nknots) k1-k&nknots;
      do i=1 to &nknots;  
        knot(i)=min+i*knotdiff;
      end;
      keep k1-k&nknots;
      run;
      
      data TPFsplines;
      if _n_=1 then set knots;
      set &data;
      array knot (&nknots) k1-k&nknots;
      array t (%eval(&nknots+&degree+1)) tpf1-tpf%eval(&nknots+&degree+1);
      do i=0 to &degree;
        t(i+1)=&var**i;
      end;
      do i=1 to &nknots;
        t(&degree+1+i)=max(0,&var-knot(i))**&degree;
      end;
      drop i;
      run;

The following code reproduces the natural cubic spline basis functions created by this EFFECT statement: effect sWeight=spline(Weight / naturalcubic);

      %let var=Weight;
      %let data=PLMscore;
      %let nknots=3;
      
      proc means data=&data noprint; 
      var &var; 
      output out=minmax min=min max=max; 
      run;
      
      data knots; /* KNOTMETHOD=EQUAL(&nknots) */
      set minmax;
      knotdiff=(max-min)/(&nknots+1);
      array knot (&nknots) k1-k&nknots;
      do i=1 to &nknots;  
        knot(i)=min+i*knotdiff;
      end;
      keep k1-k&nknots;
      run;
      
      data NCsplines;
      if _n_=1 then set knots;
      set &data;
      array knot (&nknots) k1-k&nknots;
      array nc (&nknots) nc1-nc&nknots;
      array t (%eval(&nknots+4)) tpf1-tpf%eval(&nknots+4);
      nc1=1;
      nc2=&var;
      do i=1 to &nknots;
        t(i+4)=max(0,&var-knot(i))**3;
      end;
      do i=1 to &nknots-2;
        nc(i+2)=(t(i+4)-t(&nknots+4))/(knot(&nknots)-knot(i)) - 
                (t(&nknots+3)-t(&nknots+4))/(knot(&nknots)-knot(&nknots-1));
      end;
      drop tpf: i;
      run;

Note 3: If the DATABOUNDARY spline option is specified with BASIS=BSPLINE, the following DATA step should be used to create the KNOTS data set.

      data knots; /* KNOTMETHOD=EQUAL(&nintknots) DATABOUNDARY */
        set minmax;
        knotdiff=(max-min)/(&nintknots+1);
          array knot (&nknots) k1-k%eval(&nknots);
        do i=1 to &nknots;  
          knot(i)=min(max(min,min+(i-&degree-1)*knotdiff),max);
        end;
        keep k1-k%eval(&nknots);
        run;

Note 4: For generalized linear models like logistic or Poisson models, the inverse of the link function must be applied to the linear predictor to obtain the predicted value on the mean scale. For such models, replace

pred=lp;

in the DATA SCORE step with one of the following. For a generalized linear model with binomial distribution and logit link (a logistic model), specify

pred=logistic(lp);

For a binomial model with probit link, specify

pred=probnorm(lp);

For a log-linked model, such as often used with the Poisson, negative binomial, and gamma distributions, specify

pred=exp(lp);

Note5: For modeling procedures that do not offer an OUTEST= option for saving parameter estimates, use an ODS OUTPUT statement instead. Adding the following statement in the modeling step saves the table of parameter estimates to data set PE:

        ods output ParameterEstimates=pe;

In order to use the saved parameter estimates in the DATA SCORE step, transpose the column of estimates in the PE data set. For the example above, these statements transpose the Estimate column and use the estimate names in the Parameter column to name the columns in the transposed data set, TPE. The DATA SCORE step can then do the scoring by using the TPE data set and the column names for the parameters.

      proc transpose data=pe out=tpe;
        var estimate; id parameter;
        run;
      data Score;
        if _n_=1 then set tpe;
        set Bsplines;
        lp=intercept + sWeight_1*b1 + sWeight_2*b2 + sWeight_3*b3 + sWeight_4*b4 + sWeight_5*b5 + sWeight_6*b6 + sWeight_7*b7;
        Pred=lp;
        run;

_________________

Reference

Hastie, T. J., Tibshirani, R. J., and Friedman, J. H. (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, New York: Springer-Verlag.



Operating System and Release Information

Product FamilyProductSystemSAS Release
ReportedFixed*
SAS SystemSAS/STATz/OS
z/OS 64-bit
OpenVMS VAX
Microsoft® Windows® for 64-Bit Itanium-based Systems
Microsoft Windows Server 2003 Datacenter 64-bit Edition
Microsoft Windows Server 2003 Enterprise 64-bit Edition
Microsoft Windows XP 64-bit Edition
Microsoft® Windows® for x64
OS/2
Microsoft Windows 8 Enterprise 32-bit
Microsoft Windows 8 Enterprise x64
Microsoft Windows 8 Pro 32-bit
Microsoft Windows 8 Pro x64
Microsoft Windows 8.1 Enterprise 32-bit
Microsoft Windows 8.1 Enterprise x64
Microsoft Windows 8.1 Pro
Microsoft Windows 8.1 Pro 32-bit
Microsoft Windows 10
Microsoft Windows 95/98
Microsoft Windows 2000 Advanced Server
Microsoft Windows 2000 Datacenter Server
Microsoft Windows 2000 Server
Microsoft Windows 2000 Professional
Microsoft Windows NT Workstation
Microsoft Windows Server 2003 Datacenter Edition
Microsoft Windows Server 2003 Enterprise Edition
Microsoft Windows Server 2003 Standard Edition
Microsoft Windows Server 2003 for x64
Microsoft Windows Server 2008
Microsoft Windows Server 2008 R2
Microsoft Windows Server 2008 for x64
Microsoft Windows Server 2012 Datacenter
Microsoft Windows Server 2012 R2 Datacenter
Microsoft Windows Server 2012 R2 Std
Microsoft Windows Server 2012 Std
Microsoft Windows XP Professional
Windows 7 Enterprise 32 bit
Windows 7 Enterprise x64
Windows 7 Home Premium 32 bit
Windows 7 Home Premium x64
Windows 7 Professional 32 bit
Windows 7 Professional x64
Windows 7 Ultimate 32 bit
Windows 7 Ultimate x64
Windows Millennium Edition (Me)
Windows Vista
Windows Vista for x64
64-bit Enabled AIX
64-bit Enabled HP-UX
64-bit Enabled Solaris
ABI+ for Intel Architecture
AIX
HP-UX
HP-UX IPF
IRIX
Linux
Linux for x64
Linux on Itanium
OpenVMS Alpha
OpenVMS on HP Integrity
Solaris
Solaris for x64
Tru64 UNIX
* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.