PROC MVPMODEL: Cross Validation

Cross Validation

Cross validation is a method for choosing the number of components in a model to avoid overfitting. The most common technique is one-at-a-time validation (CV=ONE) unless the observed data are serially correlated. If the data are serially correlated, either blocked or split-sample validation might be more appropriate (CV=BLOCK or CV=SPLIT); you can specify the number of test sets in blocked or split-sample validation with a number in parentheses after the CV= option. CV=ONE is the most computationally intensive of the cross validation methods, since it requires a recomputation of the principal components model for every input observation. Using random subset selection with CV=RANDOM might lead two different researchers to produce different principal components models on the same data (unless the same seed is used).

Whichever validation method you use, the number of principal components chosen is usually the one that minimizes the predicted residual sum of squares (PRESS); this is the default choice if you specify any of the CV methods with the MVPMODEL procedure. However, often models with fewer principal components have PRESS statistics that are only marginally larger than the absolute minimum.

The method of choosing the number of principal components is described in Wold (1978). The method is a heuristic based on the ratio of the PRESS and the sum of squared errors (SSE). More specifically, it is a type of forward selection. First, a null model is constructed, $\text{[math]}$ , and the SSE is computed, denoted SSE $\text{[math]}$ for a model with zero principal components. Second, the PRESS for a model with one component is computed, PRESS $\text{[math]}$ . Then the ratio PRESS $\text{[math]}$ SSE $\text{[math]}$ is computed, and is denoted as Wold’s ratio. If Wold’s ratio is less than 1, then the predictions are improved by the inclusion of a principal component. The process of computing Wold’s ratios, PRESS $\text{[math]}$ SSE $\text{[math]}$ , continues until the predictions are not improved by adding more components to the model, a condition where the Wold ratio is greater than 1. The PRESS from a model with $\text{[math]}$ principal components is denoted PRESS $\text{[math]}$ .

Extracting too many components can lead to an overfit model, one that matches the training data too well, sacrificing predictive ability. Thus, if you specify the number of principal components in the model, you should not use cross validation to select the appropriate number of components for the final model or you should consider the analysis to be preliminary and examine the results to determine the appropriate number of components for a subsequent analysis.

Note: This procedure is experimental.