PROC DISCRIM Statement 
The PROC DISCRIM statement invokes the DISCRIM procedure. The options listed in Table 32.1 are available in the PROC DISCRIM statement.
Option 
Description 

Input Data Sets 

Specifies input SAS data set 

Specifies input SAS data set to classify 

Output Data Sets 

Specifies output statistics data set 

Specifies output data set with classification results 

Specifies output data set with cross validation results 

Specifies output data set with densities 

Outputs discriminant scores to the OUT= data set 

Specifies output data set with TEST= results 

Specifies output data set with TEST= densities 

Method Details 

Specifies parametric or nonparametric method 

Specifies whether to pool the covariance matrices 

Specifies the singularity criterion 

Specifies significance level homogeneity test 

Specifies the minimum threshold for classification 

Nonparametric Methods 

Specifies k value for k nearest neighbors 

Specifies proportion, p, for computing k 

Specifies radius for kernel density estimation 

Specifies a kernel density to estimate 

Specifies metric in for squared distances 

Canonical Discriminant Analysis 

Performs canonical discriminant analysis 

Specifies a prefix for naming the canonical variables 

Specifies the number of canonical variables 

Resubstitution Classification 

Displays the classification results 

Displays the misclassified observations 

Suppresses the classification 

Displays the classification results of TEST= 

Displays the misclassified observations of TEST= 

Cross Validation Classification 

Displays the cross validation results 

Displays the misclassified cross validation results 

Specifies cross validation 

Control Displayed Output 

Displays all output 

Displays univariate statistics 

Displays between correlations 

Displays between covariances 

Displays between SSCPs 

Displays squared Mahalanobis distances 

Displays multivariate ANOVA results 

Suppresses all displayed output 

Displays pooled correlations 

Displays pooled covariances 

Displays posterior probability errorrate estimates 

Displays pooled SSCPs 

Suppresses some displayed output 

Displays simple descriptive statistics 

Displays standardized class means 

Displays total correlations 

Displays total covariances 

Displays total SSCPs 

Displays within correlations 

Displays within covariances 

Displays within SSCPs 
activates all options that control displayed output. When the derived classification criterion is used to classify observations, the ALL option also activates the POSTERR option.
displays univariate statistics for testing the hypothesis that the class means are equal in the population for each variable.
displays betweenclass covariances. The betweenclass covariance matrix equals the betweenclass SSCP matrix divided by , where n is the number of observations and c is the number of classes. You should interpret the betweenclass covariances in comparison with the totalsample and withinclass covariances, not as formal estimates of population parameters.
specifies a prefix for naming the canonical variables. By default, the names are Can1, Can2, ..., Cann. If you specify CANPREFIX=ABC, the components are named ABC1, ABC2, ABC3, and so on. The number of characters in the prefix, plus the number of digits required to designate the canonical variables, should not exceed 32. The prefix is truncated if the combined length exceeds 32.
The CANONICAL option is activated when you specify either the NCAN= or the CANPREFIX= option. A discriminant criterion is always derived in PROC DISCRIM. If you want canonical discriminant analysis without the use of discriminant criteria, you should use PROC CANDISC.
displays the cross validation classification results for each observation.
displays the cross validation classification results for misclassified observations only.
specifies the cross validation classification of the input DATA= data set. When a parametric method is used, PROC DISCRIM classifies each observation in the DATA= data set by using a discriminant function computed from the other observations in the DATA= data set, excluding the observation being classified. When a nonparametric method is used, the covariance matrices used to compute the distances are based on all observations in the data set and do not exclude the observation being classified. However, the observation being classified is excluded from the nonparametric density estimation (if you specify the R= option) or the k nearest neighbors (if you specify the K= or KPROP= option) of that observation. The CROSSVALIDATE option is set when you specify the CROSSLIST, CROSSLISTERR, or OUTCROSS= option. With these options, cross validation information is displayed or output in addition to the usual resubstitution classification results. Cross validation classification results are written to the OUTCROSS= data set, and resubstitution classification results are written to the OUT= data set.
specifies the data set to be analyzed. The data set can be an ordinary SAS data set or one of several specially structured data sets created by SAS/STAT procedures. These specially structured data sets include TYPE=CORR, TYPE=COV, TYPE=CSSCP, TYPE=SSCP, TYPE=LINEAR, TYPE=QUAD, and TYPE=MIXED. The input data set must be an ordinary SAS data set if you specify METHOD=NPAR. If you omit the DATA= option, the procedure uses the most recently created SAS data set.
displays the squared Mahalanobis distances between the group means, F statistics, and the corresponding probabilities of greater Mahalanobis squared distances between the group means. The squared distances are based on the specification of the POOL= and METRIC= options.
specifies a k value for the knearestneighbor rule. An observation is classified into a group based on the information from the k nearest neighbors of . Do not specify the K= option with the KPROP= or R= option.
specifies a proportion, p, for computing the k value for the knearestneighbor rule: , where n is the number of valid observations. When there is a FREQ statement, n is the sum of the FREQ variable for the observations used in the analysis (those without missing or invalid values). An observation is classified into a group based on the information from the k nearest neighbors of . Do not specify the KPROP= option with the K= or R= option.
specifies a kernel density to estimate the groupspecific densities. You can specify the KERNEL= option only when the R= option is specified. The default is KERNEL=UNIFORM.
displays the resubstitution classification results for each observation. You can specify this option only when the input data set is an ordinary SAS data set.
displays the resubstitution classification results for misclassified observations only. You can specify this option only when the input data set is an ordinary SAS data set.
displays multivariate statistics for testing the hypothesis that the class means are equal in the population.
determines the method to use in deriving the classification criterion. When you specify METHOD=NORMAL, a parametric method based on a multivariate normal distribution within each class is used to derive a linear or quadratic discriminant function. The default is METHOD=NORMAL. When you specify METHOD=NPAR, a nonparametric method is used and you must also specify either the K= or R= option.
specifies the metric in which the computations of squared distances are performed. If you specify METRIC=FULL, then PROC DISCRIM uses either the pooled covariance matrix (POOL=YES) or individual withingroup covariance matrices (POOL=NO) to compute the squared distances. If you specify METRIC=DIAGONAL, then PROC DISCRIM uses either the diagonal matrix of the pooled covariance matrix (POOL=YES) or diagonal matrices of individual withingroup covariance matrices (POOL=NO) to compute the squared distances. If you specify METRIC=IDENTITY, then PROC DISCRIM uses Euclidean distance. The default is METRIC=FULL. When you specify METHOD=NORMAL, the option METRIC=FULL is used.
specifies the number of canonical variables to compute. The value of number must be less than or equal to the number of variables. If you specify the option NCAN=0, the procedure displays the canonical correlations but not the canonical coefficients, structures, or means. Let v be the number of variables in the VAR statement, and let c be the number of classes. If you omit the NCAN= option, only canonical variables are generated. If you request an output data set (OUT=, OUTCROSS=, TESTOUT=), v canonical variables are generated. In this case, the last canonical variables have missing values.
The CANONICAL option is activated when you specify either the NCAN= or the CANPREFIX= option. A discriminant criterion is always derived in PROC DISCRIM. If you want canonical discriminant analysis without the use of discriminant criterion, you should use PROC CANDISC.
suppresses the resubstitution classification of the input DATA= data set. You can specify this option only when the input data set is an ordinary SAS data set.
suppresses the normal display of results. Note that this option temporarily disables the Output Delivery System (ODS); see Chapter 20, Using the Output Delivery System, for more information.
creates an output SAS data set containing all the data from the DATA= data set, plus the posterior probabilities and the class into which each observation is classified by resubstitution. When you specify the CANONICAL option, the data set also contains new variables with canonical variable scores. See the section OUT= Data Set for more information.
creates an output SAS data set containing all the data from the DATA= data set, plus the posterior probabilities and the class into which each observation is classified by cross validation. When you specify the CANONICAL option, the data set also contains new variables with canonical variable scores. See the section OUT= Data Set for more information.
creates an output SAS data set containing all the data from the DATA= data set, plus the groupspecific density estimates for each observation. See the section OUT= Data Set for more information.
creates an output SAS data set containing various statistics such as means, standard deviations, and correlations. When the input data set is an ordinary SAS data set or when TYPE=CORR, TYPE=COV, TYPE=CSSCP, or TYPE=SSCP, this option can be used to generate discriminant statistics. When you specify the CANONICAL option, canonical correlations, canonical structures, canonical coefficients, and means of canonical variables for each class are included in the data set. If you specify METHOD=NORMAL, the output data set also includes coefficients of the discriminant functions, and the output data set is TYPE=LINEAR (POOL=YES), TYPE=QUAD (POOL=NO), or TYPE=MIXED (POOL=TEST). If you specify METHOD=NPAR, this output data set is TYPE=CORR. This data set also holds calibration information that can be used to classify new observations. See the sections Saving and Using Calibration Information and OUT= Data Set for more information.
determines whether the pooled or withingroup covariance matrix is the basis of the measure of the squared distance. If you specify POOL=YES, then PROC DISCRIM uses the pooled covariance matrix in calculating the (generalized) squared distances. Linear discriminant functions are computed. If you specify POOL=NO, the procedure uses the individual withingroup covariance matrices in calculating the distances. Quadratic discriminant functions are computed. The default is POOL=YES. The knearestneighbor method assumes the default of POOL=YES, and the POOL=TEST option cannot be used with the METHOD=NPAR option.
When you specify METHOD=NORMAL, the option POOL=TEST requests Bartlett’s modification of the likelihood ratio test (Morrison; 1976; Anderson; 1984) of the homogeneity of the withingroup covariance matrices. The test is unbiased (Perlman; 1980). However, it is not robust to nonnormality. If the test statistic is significant at the level specified by the SLPOOL= option, the withingroup covariance matrices are used. Otherwise, the pooled covariance matrix is used. The discriminant function coefficients are displayed only when the pooled covariance matrix is used.
displays the posterior probability errorrate estimates of the classification criterion based on the classification results.
specifies a radius r value for kernel density estimation. With uniform, Epanechnikov, biweight, or triweight kernels, an observation is classified into a group based on the information from observations in the training set within the radius r of —that is, the group t observations with squared distance . When a normal kernel is used, the classification of an observation is based on the information of the estimated groupspecific densities from all observations in the training set. The matrix is used as the group t covariance matrix in the normalkernel density, where is the matrix used in calculating the squared distances. Do not specify the K= or KPROP= option with the R= option. For more information about selecting r, see the section Nonparametric Methods.
computes and outputs discriminant scores to the OUT= and TESTOUT= data sets with the default options METHOD=NORMAL and POOL=YES (or with METHOD=NORMAL, POOL=TEST, and a nonsignificant chisquare test). Otherwise, or if no OUT= or TESTOUT= data set is specified, this option is ignored. The scores are computed by a matrix multiplication of an intercept term and the raw data or test data by the coefficients in the linear discriminant function. One score variable is created for each level of the CLASS variable. By default, the variables are named "Sc_" followed by the formatted class level. You can specify SCORES=prefix to use a prefix other than "Sc_". The specifications SCORES and SCORES=Sc_ are equivalent.
suppresses the display of certain items in the default output. If you specify METHOD=NORMAL, then PROC DISCRIM suppresses the display of determinants, generalized squared distances betweenclass means, and discriminant function coefficients. When you specify the CANONICAL option, PROC DISCRIM suppresses the display of canonical structures, canonical coefficients, and class means on canonical variables; only tables of canonical correlations are displayed.
displays simple descriptive statistics for the total sample and within each class.
specifies the criterion for determining the singularity of a matrix, where . The default is SINGULAR=1E–8.
Let be the totalsample correlation matrix. If the R square for predicting a quantitative variable in the VAR statement from the variables preceding it exceeds , then is considered singular. If is singular, the probability levels for the multivariate test statistics and canonical correlations are adjusted for the number of variables with R square exceeding .
Let be the group t covariance matrix, and let be the pooled covariance matrix. In group t, if the R square for predicting a quantitative variable in the VAR statement from the variables preceding it exceeds , then is considered singular. Similarly, if the partial R square for predicting a quantitative variable in the VAR statement from the variables preceding it, after controlling for the effect of the CLASS variable, exceeds , then is considered singular.
If PROC DISCRIM needs to compute either the inverse or the determinant of a matrix that is considered singular, then it uses a quasi inverse or a quasi determinant. For details, see the section Quasiinverse.
specifies the significance level for the test of homogeneity. You can specify the SLPOOL= option only when POOL=TEST is also specified. If you specify POOL= TEST but omit the SLPOOL= option, PROC DISCRIM uses 0.10 as the significance level for the test.
displays totalsample and pooled withinclass standardized class means.
names an ordinary SAS data set with observations that are to be classified. The quantitative variable names in this data set must match those in the DATA= data set. When you specify the TESTDATA= option, you can also specify the TESTCLASS, TESTFREQ, and TESTID statements. When you specify the TESTDATA= option, you can use the TESTOUT= and TESTOUTD= options to generate classification results and groupspecific density estimates for observations in the test data set. Note that if the CLASS variable is not present in the TESTDATA= data set, the output will not include misclassification statistics.
lists classification results for all observations in the TESTDATA= data set.
lists only misclassified observations in the TESTDATA= data set but only if a TESTCLASS statement is also used.
creates an output SAS data set containing all the data from the TESTDATA= data set, plus the posterior probabilities and the class into which each observation is classified. When you specify the CANONICAL option, the data set also contains new variables with canonical variable scores. See the section OUT= Data Set for more information.
creates an output SAS data set containing all the data from the TESTDATA= data set, plus the groupspecific density estimates for each observation. See the section OUT= Data Set for more information.
specifies the minimum acceptable posterior probability for classification, where . If the largest posterior probability of group membership is less than the THRESHOLD value, the observation is labeled as ’Other’. The default is THRESHOLD=0. In some cases, you might want to specify a THRESHOLD= value slightly smaller than the desired p so that observations with posterior probabilities within rounding error of p are classified. For example, you can specify threshold=%sysevalf(0.5  1e8) instead of THRESHOLD=0.5 so that observations with posterior probabilities within 1E–8 of 0.5 and larger are classified.
displays the withinclass corrected SSCP matrix for each class level.