The MVPDIAGNOSE Procedure

Input Data Sets

Subsections:

DATA= Data Set
HISTORY= Data Set
LOADINGS= Data Set

The MVPDIAGNOSE procedure accepts a primary input data set that has one of the following two types:

a DATA= data set that contains new process data to be analyzed by using an existing principal component model (Phase II analysis)
a HISTORY= data set that contains process data and the accompanying scores, residuals, and statistics that are produced by using a principal component model. The process data can be the original data that were used to create the model (Phase I analysis) or subsequent data that were analyzed by using a previously created model (Phase II analysis)

These options are mutually exclusive. If you do not specify an option that identifies a primary input data set, PROC MVPDIAGNOSE uses the most recently created SAS data set as a DATA= data set.

When you specify a DATA= data set, you must also specify a LOADINGS= data set that contains principal component loadings and other information that describes the principal component model. When you specify a HISTORY= data set, you must also specify a LOADINGS= data set if you use a CONTRIBUTIONPANEL or CONTRIBUTIONPLOT statement and specify the TYPE= TSQUARE option.

DATA= Data Set

A DATA= data set provides the process measurement data for a Phase II analysis. In addition to containing the process variables, a DATA= data set can contain the following:

BY variables
ID variables
a TIME variable

When you specify a DATA= data set, you must also specify a LOADINGS= data set that contains the loadings for the principal component model that describes the variation of the process. These loadings are used to score the new data from the DATA= data set. The process variables in the LOADINGS= data set must have the same names as those in the DATA= data set.

HISTORY= Data Set

A HISTORY= data set provides the input data set for a Phase I or Phase II analysis. In addition to containing the original process variables, a HISTORY= data set contains principal component scores, residuals, SPE and $T^2$ statistics, and a count of the observations that are used to construct the principal component model. These variables are summarized in Table 11.5.

Table 11.5: Variables in the HISTORY= Data Set

Variable	Description
Prin1–Prinj	Principal component scores
R_ $var1$ –R_ $varp$	Residuals
_NOBS_	Number of observations used in the analysis
_SPE_	Squared prediction error (SPE) statistic
_TSQUARE_	$T^2$ statistic computed from principal component scores

The score variables names must consist of a common prefix followed by the numbers 1, 2, …, j, where j is the number of principal components. By default, the common prefix is Prin. You can use the PREFIX= option to specify another prefix for score variables.

If the number of principal components is less than the total number of process variables, the HISTORY= data set should also contain residual variables. A residual variable name consists of a common prefix followed by the corresponding process variable name. The default residual variable prefix is R_. For example, if the process variables are A, B, and C, the default residual variable names are R_A, R_B, and R_C. You can use the RPREFIX= option to specify a different residual variable prefix.

Note: Usually you create a HISTORY= data set by specifying the OUT= option in the PROC MVPMODEL statement or the OUTHISTORY= option in the PROC MVPMONITOR statement. If the PREFIX= or RPREFIX= option is used when such an output data set is created, you must specify the same prefixes to identify the score and residual variables when you read it as a HISTORY= data set.

LOADINGS= Data Set

A LOADINGS= data set contains the following information about the principal component model:

eigenvalues of the correlation or covariance matrix used to construct the model
principal component loadings
process variable means used to center the variable values
process variable standard deviations used to scale the variable values

You can produce a LOADINGS= data set by using the OUTLOADINGS= option in the PROC MVPMODEL statement. Table 11.6 lists the variables that are required in a LOADINGS= data set.

Table 11.6: Variables in the LOADINGS= Data Set

Variable	Description
_VALUE_	The value contained in process variables for a given observation
_NOBS_	Number of observations used to build the principal component model
_PC_	The principal component number; 0 for the observation that contains eigenvalues
process variables	Values associated with the process variables

Valid values for the _VALUE_ variable are as follows:

EIGEN: eigenvalues from the principal component analysis
LOADING: principal component loadings
MEAN: process variable means
STD: process variable standard deviations

The LOADINGS= data set contains one EIGEN observation and j LOADING observations, where j is the number of principal components in the model. The presence of a MEAN observation indicates that the process variables were centered when the principal component model was built, and the presence of a STD observation indicates that the process variables were scaled when the principal component model was built. The means and standard deviations are used to center and scale new data in a Phase II analysis.