This example uses the Iris data set that is used in examples in the PROC DISCRIM chapter and other procedure chapters in the SAS/STAT User's Guide. The Iris data set is available in the SASHELP library. Fifty specimens from each of three plant species are measured on several characteristics. In the following examples, the measurements of the sepal lengths and widths are used to classify each specimen into one of the three species. Several competing approaches are considered – multinomial logistic models (including a restricted model in PROC LOGISTIC and an unrestricted model in PROC GLIMMIX), discriminant analysis (assuming normality of the measurements and a nonparametric, nearest neighbor method), and a classification tree using PROC HPSPLIT. Additionally, an clustering method in PROC MBC is shown that ignores the species response.
The first model fit is a multinomial, generalized logit model. The EQUALSLOPES option restricts the slopes on the two logits to be equal in each of the two predictors. Since the PREDPROBS=INDIVIDUAL option is used, the MultAUC macro can be called with no options.
proc logistic data=sashelp.iris;
model Species = SepalLength SepalWidth / link=glogit equalslopes;
output out=outlog predprobs=i;
run;
%MultAUC()
The multinomial AUC is estimated to be 0.7582. The pairwise AUCs range from chance level for the Setosa-Versicolor response pair (0.5) to near perfect for the Versicolor-Virginica pair (0.9948).
|
The next model is also a generalized logit model but is unrestricted, allowing for separate slopes on the two logits for both predictors. Since the output data set with predicted probabilities from GLIMMIX has multiple output observations for each input observation, it must be restructured for use in the MultAUC macro. To enable this, a variable (OBS) identifying the input observations is added to the data before analysis.
PROC TRANSPOSE restructures the output data to have a single observation for each input observation. A DATA step then merges in the variable containing the observed responses. The CATS function is used just in case the response (Species) levels have leading blanks which should be removed. This data set can then be analyzed by the MultAUC macro after identifying the observed response variable.
Note that effectively the same unrestricted model can be fit in PROC LOGISTIC by simply removing the EQUALSLOPES option from the above example. However, GLIMMIX is used here to illustrate how its output data set can be modified for use with the MultAUC macro.
data iris2;
set sashelp.iris;
obs=_n_;
run;
proc glimmix data=iris2;
model Species = SepalLength SepalWidth / solution dist=mult link=glogit;
output out=outglim predicted(ilink);
run;
proc transpose data=outglim out=outglim2;
by obs; id _level_; var predmu;
run;
data outglim3;
merge iris2(keep=obs species) outglim2;
by obs;
Species=cats(Species);
run;
%MultAUC(response=Species)
For this model the overall and pairwise AUC values are all higher than seen for the restricted model above. The overall AUC is 0.93.
Another classification method is discriminant analysis. The following statements perform a parametric analysis that assumes that the predictors are normally distributed. The OUT= data set produced by PROC DISCRIM has the basic structure needed so no modification is needed.
proc discrim data=sashelp.iris out=outdis;
class Species; var SepalLength SepalWidth;
run;
%MultAUC(response=Species)
The overall and pairwise AUC values for this analysis are similar to those found for the unrestricted generalized logit model above.
A nonparametric discriminant analysis is performed next using the k nearest neighbor method. For this analysis, k=9 nearest neighbors are used to develop the discriminant criterion.
proc discrim data=sashelp.iris method=npar k=9 out=outdis;
class Species; var SepalLength SepalWidth;
run;
%MultAUC(response=Species)
The overall AUC is again similar to the best results above. The AUC for the Versicolor-Virginica pair is stronger than in the previous analyses.
Next, a tree model is fit in PROC HPSPLIT using the default tree growing and pruning methods. Prior to analysis, the CATS function is used to
remove any leading blanks that might exist in the response (Species) levels. Though that is not necessary in this case, it is recommended to avoid
any problems that leading blanks can cause.
Note that the predicted probability variable names produced by the procedure prefix the response (Species) levels with "P_Species". To match these
names, prefix=p_species is specified in the MultAUC call.
data iris; set sashelp.iris;
Species=cats(Species);
run;
proc hpsplit data=iris seed=48393;
class Species;
model Species = SepalLength SepalWidth;
output out=outspt;
run;
%MultAUC(response=Species, prefix=p_species)
The overall AUC (0.90) is not quite as good as some of the models above.
The above modeling methods are known as supervised methods since the response is known and is used in developing the model. There are also unsupervised methods, such as clustering methods, that can be used to find groups in data when the true classifications of the observations are not known. Observations that are similar, based on some criterion, are grouped together in a cluster. SAS/STAT procedures that implement such clustering methods include the CLUSTER, FASTCLUS, and MODECLUS procedures. Model-based clustering is also available beginning in SAS® Viya® 3.4 and is used next.
Since this is an unsupervised method, the true classifications in the Species variable are not used. The method finds clusters of the observations based on their similarity using only the SepalLength and SepalWidth measurements. After creating a CAS table of the Iris data, the following statements initialize the method using k-means clustering and request a three-cluster solution. All possible covariance structures, as specified in the CONSTRUCT= option, are considered in choosing a final model. The output data set contains the cluster membership from the model for each observation (MAXPOST) and the posterior probabilities of membership in each cluster (named NEXT1, NEXT2, and NEXT3, where 1, 2, and 3 refer to the cluster number). The COPYVARS= option copies the Species, SepalLength, and SepalWidth variables from the input data set.
proc mbc data=casuser.iris nclusters=(3) init=kmeans seed=1418410433
covstruct=(EEE EEI EEV EII EVI EVV VII VVI VVV);
var SepalLength SepalWidth;
output out=casuser.scores maxpost copyvars=(Species SepalLength SepalWidth);
run;
The following produces a cross-classification of the cluster numbers assigned to the observations and the known Species levels.
proc freq data=casuser.scores;
table maxpost*Species;
run;
Notice that cluster 2 exactly contains the Setosa observations. While clusters 1 and 3 do not completely separate the other two Species, cluster 1 contains mostly Versicolor observations and cluster 3 mostly Virginica observations.
These statements name the cluster posterior probability variables according to the Species with which they correspond as shown above, and then call the MultAUC macro.
data casuser.scores;
set casuser.scores;
Setosa = next2;
Virginica = next3;
Versicolor = next1;
run;
%MultAUC(data=casuser.scores, response=Species)
The resulting AUC (0.92) is only slightly below the best models above.
Among the models and methods considered above, the best performing models as judged by the AUC are the unrestricted generalized logit model and the nonparametric discriminant model with AUC values approximately equal to 0.93. Of course, all of these models and methods of estimating them could be altered in various ways, so it is possible that better models of each type could be found.
This example uses the wine data from the Getting Started section in the PROC HPSPLIT chapter of the SAS/STAT User's Guide. The data record a three-level variable, Cultivar, and 13 chemical attributes on 178 wine samples. The following statements creates a random 60% training subset and 40% test subset of the data. Computing the AUC on the data used to train a model can result in an overly optimistic value. By using a test data set for the computation, a more realistic AUC estimate can be obtained.
data Winetrn Winetst;
set Wine;
if ranuni(8473)>=.4 then output Winetrn;
else output Winetst;
run;
The LASSO selection method in PROC HPGENSELECT can be used to select the important predictors to use in a model. The following identifies Mg and Proline as the final model from the LASSO method.
proc hpgenselect data=Winetrn;
model Cultivar = Alcohol Malic Ash Alkan Mg TotPhen Flav
NFPhen Cyanins Color Hue ODRatio Proline
/ dist=mult link=glogit;
selection method=lasso;
run;
The selected model is fit in PROC LOGISTIC and the test data set is scored by the SCORE statement using the fitted model. The training data set is also scored by the OUTPUT data set.
proc logistic data=Winetrn;
model Cultivar = Mg Proline / link=glogit;
score data=Winetst out=Winetst;
output out=Predtrn predprobs=i;
run;
The MultAUC macro can be called to compute the AUC for each of the training and test data sets. The predicted probability variable names in the OUT= data set created by the SCORE statement use the prefix P_.
The data set also contains the character variable, F_Cultivar, which contains the observed response levels. Specifying prefix=P_ adds the prefix to the
response levels to match the predicted probability variable names.
%MultAUC(data=Predtrn)
%MultAUC(data=Winetst, response=F_Cultivar. prefix=P_)
Using the LASSO-selected model developed above, the estimated AUC for the training data set is 0.90.
For the test data set, the selected model estimates the AUC as 0.89 which is slightly less optimistic than was obtained above from the training data set.