The MVPMODEL Procedure

Cross Validation


Note: The CV= option is experimental in this release.

You can use cross validation to choose the number of principal components in the model to avoid overfitting.

One method of choosing the number of principal components is to fit the model to only part of the available data (the training set) and to measure how well models with different numbers of extracted components fit the other part of the data (the test set). This is called test set validation. However, it is rare that you have enough make both parts large enough for pure test set validation to be useful. Alternatively, you can make several different divisions of the observed data into a training set and a test set. This is called cross validation. The MVPMODEL procedure supports four types of cross validation. In one-at-a-time cross validation, the first observation is held out as a single-element test set, with all other observations as the training set; next, the second observation is held out, then the third, and so on. Another method is to hold out successive blocks of observations as test sets—for example, observations 1 through 7, then observations 8 through 14, and so on; this is known as blocked validation. A similar method is split-sample cross validation, in which successive groups of widely separated observations are held out as the test set—for example, observations {1, 11, 21, …}, then observations {2, 12, 22, …}, and so on. Finally, test sets can be selected from the observed data randomly; this is known as random-sample cross validation.

Which cross validation method you should use depends on your data. The most common method is one-at-a-time validation (CV=ONE), but it is not appropriate when the observed data are serially correlated. In that case either blocked (CV=BLOCK) or split-sample (CV=SPLIT) validation might be more appropriate; you can select the number of test sets in blocked or split-sample validation by specifying options in parentheses after the CV= option. The numbers in parentheses are the number of test sets over the rows and columns. For more information, see the section An Alternative Scheme in Wold (1978), as well as Eastment and Krzanowski (1982), both of which describe the cross validation approach used here in more detail.

CV=ONE is the most computationally intensive of the cross validation methods, because it requires you to recompute the principal component model for every input observation. Using random subset selection with CV=RANDOM might lead different researchers to produce different principal component models from the same data (unless the same seed is used).

Whichever validation method you use, the number of principal components that are chosen is usually the one that optimizes some criterion or selection rule. Choices of a criterion include the ratio described by Wold (1978), the W statistic described by Eastment and Krzanowski (1982), and the predicted residual sum of squares (PRESS). The W statistic is used by the MVPMODEL procedure.

The method of choosing the number of principal components in the MVPMODEL procedure is described in Eastment and Krzanowski (1982). This method is a heuristic based on the ratio of the mean PRESS (MPRESS) to the degrees of freedom for the principal component model. First, the MPRESS is computed for models with 0 to $maxcomp$ principal components. The maximum number of components is $\min \left( 15, nvar, nobs \right)-1$ and can be further reduced to the number of nonzero eigenvalues in the covariance matrix. Second, for each of the $i$ possible number of components, the $W_ i$ statistic is computed as

\[  W_ i = \frac{ MPRESS(i-1) - MPRESS(i) }{D_ i} \div \frac{ MPRESS(i) }{D_ R}  \]

where $MPRESS = \frac{1}{np}PRESS$, $D_ i$ is the number of degrees of freedom used to fit the model with $i$ principal components, and $D_ R$ is the remaining number of degrees of freedom.

Extracting too many components can lead to an overfit model, one that matches the training data too well, sacrificing predictive ability. Thus, if you specify the number of principal components in the model, you should not use cross validation to select the appropriate number of components for the final model, or you should consider the analysis to be preliminary and examine the results to determine the appropriate number of components for a subsequent analysis.