The DISCRIM Procedure

Output Data Sets

Subsections:

OUT= Data Set
OUTD= Data Set
OUTCROSS= Data Set
TESTOUT= Data Set
TESTOUTD= Data Set
OUTSTAT= Data Set

When an output data set includes variables containing the posterior probabilities of group membership (OUT=, OUTCROSS=, or TESTOUT= data sets) or group-specific density estimates (OUTD= or TESTOUTD= data sets), the names of these variables are constructed from the formatted values of the class levels converted to valid SAS variable names.

OUT= Data Set

The OUT= data set contains all the variables in the DATA= data set, plus new variables containing the posterior probabilities and the resubstitution classification results. The names of the new variables containing the posterior probabilities are constructed from the formatted values of the class levels converted to SAS names. A new variable, _INTO_, with the same attributes as the CLASS variable, specifies the class to which each observation is assigned. If an observation is labeled as Other, the variable _INTO_ has a missing value. When you specify the CANONICAL option, the data set also contains new variables with canonical variable scores. The NCAN= option determines the number of canonical variables. The names of the canonical variables are constructed as described in the CANPREFIX= option. The canonical variables have means equal to zero and pooled within-class variances equal to one.

An OUT= data set cannot be created if the DATA= data set is not an ordinary SAS data set.

OUTD= Data Set

The OUTD= data set contains all the variables in the DATA= data set, plus new variables containing the group-specific density estimates. The names of the new variables containing the density estimates are constructed from the formatted values of the class levels.

An OUTD= data set cannot be created if the DATA= data set is not an ordinary SAS data set.

OUTCROSS= Data Set

The OUTCROSS= data set contains all the variables in the DATA= data set, plus new variables containing the posterior probabilities and the classification results of cross validation. The names of the new variables containing the posterior probabilities are constructed from the formatted values of the class levels. A new variable, _INTO_, with the same attributes as the CLASS variable, specifies the class to which each observation is assigned. When an observation is labeled as Other, the variable _INTO_ has a missing value. When you specify the CANONICAL option, the data set also contains new variables with canonical variable scores. The NCAN= option determines the number of new variables. The names of the new variables are constructed as described in the CANPREFIX= option. The new variables have mean zero and pooled within-class variance equal to one.

An OUTCROSS= data set cannot be created if the DATA= data set is not an ordinary SAS data set.

TESTOUT= Data Set

The TESTOUT= data set contains all the variables in the TESTDATA= data set, plus new variables containing the posterior probabilities and the classification results. The names of the new variables containing the posterior probabilities are formed from the formatted values of the class levels. A new variable, _INTO_, with the same attributes as the CLASS variable, gives the class to which each observation is assigned. If an observation is labeled as Other, the variable _INTO_ has a missing value. When you specify the CANONICAL option, the data set also contains new variables with canonical variable scores. The NCAN= option determines the number of new variables. The names of the new variables are formed as described in the CANPREFIX= option.

TESTOUTD= Data Set

The TESTOUTD= data set contains all the variables in the TESTDATA= data set, plus new variables containing the group-specific density estimates. The names of the new variables containing the density estimates are formed from the formatted values of the class levels.

OUTSTAT= Data Set

The OUTSTAT= data set is similar to the TYPE=CORR data set produced by the CORR procedure. The data set contains various statistics such as means, standard deviations, and correlations. For an example of an OUTSTAT= data set, see Example 35.3. When you specify the CANONICAL option, canonical correlations, canonical structures, canonical coefficients, and means of canonical variables for each class are included in the data set.

If you specify METHOD=NORMAL, the output data set also includes coefficients of the discriminant functions, and the data set is TYPE=LINEAR (POOL=YES), TYPE=QUAD (POOL=NO), or TYPE=MIXED (POOL=TEST). If you specify METHOD=NPAR, this output data set is TYPE=CORR.

The OUTSTAT= data set contains the following variables:

the BY variables, if any
the CLASS variable
_TYPE_, a character variable of length 8 that identifies the type of statistic
_NAME_, a character variable of length 32 that identifies the row of the matrix, the name of the canonical variable, or the type of the discriminant function coefficients
the quantitative variables—that is, those in the VAR statement, or, if there is no VAR statement, all numeric variables not listed in any other statement

The observations, as identified by the variable _TYPE_, have the following values:

_TYPE_: Contents
N: number of observations both for the total sample (CLASS variable missing) and within each class (CLASS variable present)
SUMWGT: sum of weights both for the total sample (CLASS variable missing) and within each class (CLASS variable present), if a WEIGHT statement is specified
MEAN: means both for the total sample (CLASS variable missing) and within each class (CLASS variable present)
PRIOR: prior probability for each class
STDMEAN: total-standardized class means
PSTDMEAN: pooled within-class standardized class means
STD: standard deviations both for the total sample (CLASS variable missing) and within each class (CLASS variable present)
PSTD: pooled within-class standard deviations
BSTD: between-class standard deviations
RSQUARED: univariate R squares
LNDETERM: the natural log of the determinant or the natural log of the quasi determinant of the within-class covariance matrix either pooled (CLASS variable missing) or not pooled (CLASS variable present)

The following kinds of observations are identified by the combination of the variables _TYPE_ and _NAME_. When the _TYPE_ variable has one of the following values, the _NAME_ variable identifies the row of the matrix:

_TYPE_: Contents
CSSCP: corrected SSCP matrix both for the total sample (CLASS variable missing) and within each class (CLASS variable present)
PSSCP: pooled within-class corrected SSCP matrix
BSSCP: between-class SSCP matrix
COV: covariance matrix both for the total sample (CLASS variable missing) and within each class (CLASS variable present)
PCOV: pooled within-class covariance matrix
BCOV: between-class covariance matrix
CORR: correlation matrix both for the total sample (CLASS variable missing) and within each class (CLASS variable present)
PCORR: pooled within-class correlation matrix
BCORR: between-class correlation matrix

When you request canonical discriminant analysis, the _NAME_ variable identifies a canonical variable, and _TYPE_ variable can have one of the following values:

_TYPE_: Contents
CANCORR: canonical correlations
STRUCTUR: canonical structure
BSTRUCT: between canonical structure
PSTRUCT: pooled within-class canonical structure
SCORE: standardized canonical coefficients
RAWSCORE: raw canonical coefficients
CANMEAN: means of the canonical variables for each class

When you specify METHOD=NORMAL, the _NAME_ variable identifies different types of coefficients in the discriminant function, and the _TYPE_ variable can have one of the following values:

_TYPE_: Contents
LINEAR: coefficients of the linear discriminant functions
QUAD: coefficients of the quadratic discriminant functions

The values of the _NAME_ variable are as follows:

_NAME_: Contents
variable names: quadratic coefficients of the quadratic discriminant functions (a symmetric matrix for each class)
_LINEAR_: linear coefficients of the discriminant functions
_CONST_: constant coefficients of the discriminant functions