The QUANTSELECT Procedure

Example 96.4 Surface Fitting with Many Noisy Variables

This example is based on Surface Fitting with Many Noisy Variables in Chapter 25: The ADAPTIVEREG Procedure. This example shows how you can use the EFFECT statement to select a nonlinear surface model for a data set that contains many nuisance variables.

Consider a simulated data set that contains a response variable and 10 continuous predictors. Each continuous predictor is sampled independently from the uniform distribution $U(0,1)$ . The true model of the artificial data set depends nonlinearly on two variables x1 and x2:

$y = \frac{40\exp \left(8\left((x_1-0.5)^2+(x_2-0.5)^2\right)\right)}{\exp \left(8\left((x_1-0.2)^2+(x_2-0.7)^2\right)\right)+ \exp \left(8\left((x_1-0.7)^2+(x_2-0.2)^2\right)\right)}$

The values of the response variable are generated by adding errors from the standard normal distribution $N(0,1)$ to the true model. The generating mechanism is adapted from Gu et al. (1990). The following statements create an artificial data set that contains 400 observations for the purpose of effect selection and 10,201 observations of missing response values for the purpose of prediction:

%let p=10;
data artificial;
   drop i;
   array x{&p};
   do i=1 to 400;
      do j=1 to &p;
         x{j} = ranuni(1);
      end;
      yTrain = 40*exp(8*((x1-0.5)**2+(x2-0.5)**2))/
          (exp(8*((x1-0.2)**2+(x2-0.7)**2))+
          exp(8*((x1-0.7)**2+(x2-0.2)**2)))+rannor(1);
      output;
   end;

   yTrain = .;
   do x1=0 to 1 by 0.01;
      do x2 = 0 to 1 by 0.01;
         y = 40*exp(8*((x1-0.5)**2+(x2-0.5)**2))/
             (exp(8*((x1-0.2)**2+(x2-0.7)**2))+
             exp(8*((x1-0.7)**2+(x2-0.2)**2)));
         output;
      end;
   end;
run;

The variables x3 through x10 are nuisance variables that can cause overfitting in your analysis. The following statements invoke the QUANTSELECT procedure to select effects, fit a model on the selected effects, and output the model predictions to an output data set Out:

%macro art;
   proc quantselect data=artificial algorithm=smooth;
      %do i=1 %to &p;
         effect sp&i = spline(x&i);
      %end;
      model yTrain =
         sp1 %do i=2 %to &p; |sp&i %end; @2/details=all;
      output out=Out p=pred;
   run;
%mend;

%art;

You can use the EFFECT statement to generate nonlinear effects and model a nonlinear surface. This example uses spline effects on variables and includes all the two-way interactions among these spline effects.

The ALGORITHM=SMOOTH option specifies the smoothing algorithm for model fitting. It takes approximately 2.8 seconds to select the model on a PC that has an Intel i7-2600 quad-core CPU and 64-bit Windows 7 Enterprise operation system. If you use the ALGORITHM=SIMPLEX option, which is default, it takes approximately 8.7 seconds for the same computation settings.

Output 96.4.1 shows the model information. By default, the effect selection method is the stepwise method, and the selection criterion is SBC for the SELECT=, CHOOSE=, and STOP= options. The default quantile level is 0.5 for median regression.

Output 96.4.1: Model Information

The QUANTSELECT Procedure

Model Information
Data Set	WORK.ARTIFICIAL
Dependent Variable	yTrain
Selection Method	Stepwise
Quantile Type	Single Level
Select Criterion	SBC
Stop Criterion	SBC
Choose Criterion	SBC

Output 96.4.2 shows the best 10 entry candidates at the selection step. You can see that sp1*sp2 is the most important effect, followed by sp1 and sp2.

Output 96.4.2: Best 10 Entry Candidates at Step 1

Best 10 Entry Candidates
Rank	Effect	SBC
1	sp1*sp2	-496.6752
2	sp1	165.9104
3	sp2	178.2126
4	sp3	213.4593
5	sp6	220.8471
6	sp7	222.0916
7	sp9	224.3185
8	sp4	224.7100
9	sp8	226.8373
10	sp5	227.2176

Output 96.4.3 shows the selection summary.

Output 96.4.3: Selection Summary

The QUANTSELECT Procedure

Quantile Level = 0.5

Selection Summary
Step	Effect Entered	Number Effects In	Number Parms In	SBC
0	Intercept	1	1	195.8108
1	sp1*sp2	2	49	-496.6752*
* Optimal Value Of Criterion

The following statements produce a graph that shows both the true model and the fitted model:

ods graphics on;
data pred;
   set out;
   where yTrain=.;
run;

%let off0 = offsetmin=0 offsetmax=0;
%let off0 = xaxisopts=(&off0) yaxisopts=(&off0);
%let eopt = location=outside valign=top textattrs=graphlabeltext;
proc template;
   define statgraph surfaces;
      begingraph / designheight=360px;
         layout lattice/columns=2;
            layout overlay / &off0;
               entry "True Model" / &eopt;
               contourplotparm z=y y=x2 x=x1;
            endlayout;
            layout overlay / &off0;
               entry "Fitted Model" / &eopt;
               contourplotparm z=pred y=x2 x=x1;
            endlayout;
         endlayout;
      endgraph;
   end;
run;

proc sgrender data=pred template=surfaces;
run;

Output 96.4.4 displays surfaces for both the true model and the fitted model. You can see that the fitted model nicely approximates the underlying true model.

Output 96.4.4: True Model and Fitted Model