Input Data Sets

The MVPDIAGNOSE procedure accepts a primary input data set that has one of the following two types:

  • a DATA= data set that contains new process data to be analyzed by using an existing principal component model (Phase II analysis)

  • a HISTORY= data set that contains process data and the accompanying scores, residuals, and statistics that are produced by using a principal component model. The process data can be the original data that were used to create the model (Phase I analysis) or subsequent data that were analyzed by using a previously created model (Phase II analysis)

These options are mutually exclusive. If you do not specify an option that identifies a primary input data set, PROC MVPDIAGNOSE uses the most recently created SAS data set as a DATA= data set.

When you specify a DATA= data set, you must also specify a LOADINGS= data set that contains principal component loadings and other information that describes the principal component model. When you specify a HISTORY= data set, you must also specify a LOADINGS= data set if you use a CONTRIBUTIONPANEL or CONTRIBUTIONPLOT statement and specify the TYPE=TSQUARE option.

DATA= Data Set

A DATA= data set provides the process measurement data for a Phase II analysis. In addition to containing the process variables, a DATA= data set can contain the following:

  • BY variables

  • ID variables

  • a TIME variable

When you specify a DATA= data set, you must also specify a LOADINGS= data set that contains the loadings for the principal component model that describes the variation of the process. These loadings are used to score the new data from the DATA= data set. The process variables in the LOADINGS= data set must have the same names as those in the DATA= data set.


A HISTORY= data set provides the input data set for a Phase I or Phase II analysis. In addition to containing the original process variables, a HISTORY= data set contains principal component scores, residuals, SPE and $T^2$ statistics, and a count of the observations that are used to construct the principal component model. These variables are summarized in Table 11.5.

Table 11.5: Variables in the HISTORY= Data Set




Principal component scores




Number of observations used in the analysis


Squared prediction error (SPE) statistic


$T^2$ statistic computed from principal component scores

The score variables names must consist of a common prefix followed by the numbers 1, 2, …, j, where j is the number of principal components. By default, the common prefix is Prin. You can use the PREFIX= option to specify another prefix for score variables.

If the number of principal components is less than the total number of process variables, the HISTORY= data set should also contain residual variables. A residual variable name consists of a common prefix followed by the corresponding process variable name. The default residual variable prefix is R_. For example, if the process variables are A, B, and C, the default residual variable names are R_A, R_B, and R_C. You can use the RPREFIX= option to specify a different residual variable prefix.

Note: Usually you create a HISTORY= data set by specifying the OUT= option in the PROC MVPMODEL statement or the OUTHISTORY= option in the PROC MVPMONITOR statement. If the PREFIX= or RPREFIX= option is used when such an output data set is created, you must specify the same prefixes to identify the score and residual variables when you read it as a HISTORY= data set.


A LOADINGS= data set contains the following information about the principal component model:

  • eigenvalues of the correlation or covariance matrix used to construct the model

  • principal component loadings

  • process variable means used to center the variable values

  • process variable standard deviations used to scale the variable values

You can produce a LOADINGS= data set by using the OUTLOADINGS= option in the PROC MVPMODEL statement. Table 11.6 lists the variables that are required in a LOADINGS= data set.

Table 11.6: Variables in the LOADINGS= Data Set




The value contained in process variables for a given observation


Number of observations used to build the principal component model


The principal component number; 0 for the observation that contains eigenvalues

process variables

Values associated with the process variables

Valid values for the _VALUE_ variable are as follows:


eigenvalues from the principal component analysis


principal component loadings


process variable means


process variable standard deviations

The LOADINGS= data set contains one EIGEN observation and j LOADING observations, where j is the number of principal components in the model. The presence of a MEAN observation indicates that the process variables were centered when the principal component model was built, and the presence of a STD observation indicates that the process variables were scaled when the principal component model was built. The means and standard deviations are used to center and scale new data in a Phase II analysis.