A PLS model is not complete until you choose the number of factors. You can choose the number of factors by using test set validation, in which the data set is divided into two groups called the training data and test data. You fit the model to the training data, and then you check the capability of the model to predict responses for the test data. The predicted residual sum of squares (PRESS) statistic is based on the residuals that are generated by this process.
To select the number of extracted factors by test set validation, you use the PARTITION statement to specify how observations in the input data set are logically divided into two subsets for model training and testing. For example, you can designate a variable in the input data set and a set of formatted values of that variable to determine the role of each observation, as in the following SAS statements:
proc hppls data=sample; model ls ha dt = v1-v27; partition roleVar = Role(train='TRAIN' test='TEST'); run;
The resulting output is shown in Figure 11.4 through Figure 11.6.
Figure 11.6: PLS Variation Summary for Test-Set-Validated Model
Percent Variation Accounted for by Partial Least Squares Factors | ||||
---|---|---|---|---|
Number of Extracted Factors |
Model Effects | Dependent Variables | ||
Current | Total | Current | Total | |
1 | 95.92495 | 95.92495 | 37.27071 | 37.27071 |
2 | 3.86407 | 99.78903 | 32.38167 | 69.65238 |
3 | 0.10170 | 99.89073 | 20.76882 | 90.42120 |
4 | 0.08979 | 99.98052 | 4.66666 | 95.08787 |
5 | 0.01142 | 99.99194 | 3.88184 | 98.96971 |
In Figure 11.4, the "Model Information" table indicates that test set validation is used. The "Number of Observations" table shows that nine sample observations are assigned for training roles and seven are assigned for testing roles.
Figure 11.5 provides details about the results from test set validation. These results show that the absolute minimum PRESS is achieved with five extracted factors. Notice, however, that this is not much smaller than the PRESS for three factors. By using the CVTEST option, you can perform a statistical model comparison that is suggested by van der Voet (1994) to test whether this difference is significant, as shown in the following SAS statements:
proc hppls data=sample cvtest(pval=0.15 seed=12345); model ls ha dt = v1-v27; partition roleVar = Role(train='TRAIN' test='TEST'); run;
The model comparison test is based on a rerandomization of the data. By default, the seed for this randomization is based on the system clock, but it is specified here. The resulting output is presented in Figure 11.7 through Figure 11.9.
Figure 11.8: Testing Test Set Validation for Number of Factors
Test Set Validation for the Number of Extracted Factors |
|||
---|---|---|---|
Number of Extracted Factors |
Root Mean PRESS |
T**2 | Prob > T**2 |
0 | 1.426362 | 5.191629 | 0.0440 |
1 | 1.276694 | 6.174825 | 0.0440 |
2 | 1.181752 | 4.60203 | 0.1020 |
3 | 0.656999 | 3.09999 | 0.6340 |
4 | 0.43457 | 4.980227 | 0.7430 |
5 | 0.420916 | 0 | 1.0000 |
6 | 0.585031 | 2.05496 | 1.5300 |
7 | 0.576586 | 3.009172 | 1.9990 |
8 | 0.563935 | 2.416635 | 2.7550 |
9 | 0.563935 | 2.416635 | 3.4800 |
The "Model Information" table in Figure 11.7 displays information about the options that are used in the model comparison test. In Figure 11.8, the p-value in comparing the test-set validated residuals from models that have five and three factors indicates that the difference between the two models is insignificant; therefore, the model with fewer factors is preferred. The variation summary in Figure 11.9 shows that more than 99% of the predictor variation and more than 90% of the response variation are accounted for by the three factors.