The HPPLS Procedure

Selecting the Number of Factors by Test Set Validation

A PLS model is not complete until you choose the number of factors. You can choose the number of factors by using test set validation, in which the data set is divided into two groups called the training data and test data. You fit the model to the training data, and then you check the capability of the model to predict responses for the test data. The predicted residual sum of squares (PRESS) statistic is based on the residuals that are generated by this process.

To select the number of extracted factors by test set validation, you use the PARTITION statement to specify how observations in the input data set are logically divided into two subsets for model training and testing. For example, you can designate a variable in the input data set and a set of formatted values of that variable to determine the role of each observation, as in the following SAS statements:

proc hppls data=sample;
   model ls ha dt = v1-v27;
   partition roleVar = Role(train='TRAIN' test='TEST');
run;

The resulting output is shown in Figure 11.4 through Figure 11.6.

Figure 11.4: Model Information and Number of Observations with Test Set Validation

The HPPLS Procedure

Model Information
Data Source WORK.SAMPLE
Factor Extraction Method Partial Least Squares
PLS Algorithm NIPALS
Validation Method Test Set Validation

Number of Observations Read 16
Number of Observations Used 16
Number of Observations Used for Training 9
Number of Observations Used for Testing 7



Figure 11.5: Test-Set-Validated PRESS Statistics for Number of Factors

The HPPLS Procedure

Test Set Validation for
the Number of Extracted
Factors
Number of
Extracted
Factors
Root
Mean
PRESS
0 1.426362
1 1.276694
2 1.181752
3 0.656999
4 0.43457
5 0.420916
6 0.585031
7 0.576586
8 0.563935
9 0.563935

Minimum Root Mean PRESS 0.420916
Minimizing Number of Factors 5



Figure 11.6: PLS Variation Summary for Test-Set-Validated Model

Percent Variation Accounted for by Partial Least Squares Factors
Number of
Extracted
Factors
Model Effects Dependent Variables
Current Total Current Total
1 95.92495 95.92495 37.27071 37.27071
2 3.86407 99.78903 32.38167 69.65238
3 0.10170 99.89073 20.76882 90.42120
4 0.08979 99.98052 4.66666 95.08787
5 0.01142 99.99194 3.88184 98.96971



In Figure 11.4, the "Model Information" table indicates that test set validation is used. The "Number of Observations" table shows that nine sample observations are assigned for training roles and seven are assigned for testing roles.

Figure 11.5 provides details about the results from test set validation. These results show that the absolute minimum PRESS is achieved with five extracted factors. Notice, however, that this is not much smaller than the PRESS for three factors. By using the CVTEST option, you can perform a statistical model comparison that is suggested by van der Voet (1994) to test whether this difference is significant, as shown in the following SAS statements:

proc hppls data=sample cvtest(pval=0.15 seed=12345);
   model ls ha dt = v1-v27;
   partition roleVar = Role(train='TRAIN' test='TEST');
run;

The model comparison test is based on a rerandomization of the data. By default, the seed for this randomization is based on the system clock, but it is specified here. The resulting output is presented in Figure 11.7 through Figure 11.9.

Figure 11.7: Model Information with Model Comparison Test

The HPPLS Procedure

Model Information
Data Source WORK.SAMPLE
Factor Extraction Method Partial Least Squares
PLS Algorithm NIPALS
Validation Method Test Set Validation
Validation Testing Criterion Prob T**2 > 0.15
Number of Random Permutations 1000
Random Number Seed for Permutation 12345



Figure 11.8: Testing Test Set Validation for Number of Factors

The HPPLS Procedure

Test Set Validation for the Number
of Extracted Factors
Number of
Extracted
Factors
Root
Mean
PRESS
T**2 Prob >
T**2
0 1.426362 5.191629 0.0440
1 1.276694 6.174825 0.0440
2 1.181752 4.60203 0.1020
3 0.656999 3.09999 0.6340
4 0.43457 4.980227 0.7430
5 0.420916 0 1.0000
6 0.585031 2.05496 1.5300
7 0.576586 3.009172 1.9990
8 0.563935 2.416635 2.7550
9 0.563935 2.416635 3.4800

Minimum Root Mean PRESS 0.420916
Minimizing Number of Factors 5
Smallest Number of Factors with p > 0.15 3



Figure 11.9: PLS Variation Summary for Tested Test-Set-Validated Model

Percent Variation Accounted for by Partial Least Squares Factors
Number of
Extracted
Factors
Model Effects Dependent Variables
Current Total Current Total
1 95.92495 95.92495 37.27071 37.27071
2 3.86407 99.78903 32.38167 69.65238
3 0.10170 99.89073 20.76882 90.42120



The "Model Information" table in Figure 11.7 displays information about the options that are used in the model comparison test. In Figure 11.8, the p-value in comparing the test-set validated residuals from models that have five and three factors indicates that the difference between the two models is insignificant; therefore, the model with fewer factors is preferred. The variation summary in Figure 11.9 shows that more than 99% of the predictor variation and more than 90% of the response variation are accounted for by the three factors.