Getting Started: DISCRIM Procedure
The data in this example are measurements of 159 fish caught in Finland’s lake Laengelmavesi; this data set is available from the Puranen. For each of the seven species (bream, roach, whitefish, parkki, perch, pike, and smelt) the weight, length, height, and width of each fish are tallied. Three different length measurements are recorded: from the nose of the fish to the beginning of its tail, from the nose to the notch of its tail, and from the nose to the end of its tail. The height and width are recorded as percentages of the third length variable. The fish data set is available from the Sashelp library. The goal now is to find a discriminant function based on these six variables that best classifies the fish into species.
First, assume that the data are normally distributed within each group with equal covariances across groups. The following statements use PROC DISCRIM to analyze the Sashelp.Fish data and create Figure 32.1 through Figure 32.5:
title 'Fish Measurement Data';
proc discrim data=sashelp.fish;
class Species;
run;
The DISCRIM procedure begins by displaying summary information about the variables in the analysis (see Figure 32.1). This information includes the number of observations, the number of quantitative variables in the analysis (specified with the VAR statement), and the number of classes in the classification variable (specified with the CLASS statement). The frequency of each class, its weight, the proportion of the total sample, and the prior probability are also displayed. Equal priors are assigned by default.
Figure 32.1
Summary Information
Bream 
34 
34.0000 
0.215190 
0.142857 
Parkki 
11 
11.0000 
0.069620 
0.142857 
Perch 
56 
56.0000 
0.354430 
0.142857 
Pike 
17 
17.0000 
0.107595 
0.142857 
Roach 
20 
20.0000 
0.126582 
0.142857 
Smelt 
14 
14.0000 
0.088608 
0.142857 
Whitefish 
6 
6.0000 
0.037975 
0.142857 
The natural log of the determinant of the pooled covariance matrix is displayed in Figure 32.2.
Figure 32.2
Pooled Covariance Matrix Information
The squared distances between the classes are shown in Figure 32.3.
Figure 32.3
Squared Distances
The DISCRIM Procedure
0 
83.32523 
243.66688 
310.52333 
133.06721 
252.75503 
132.05820 
83.32523 
0 
57.09760 
174.20918 
27.00096 
60.52076 
26.54855 
243.66688 
57.09760 
0 
101.06791 
29.21632 
29.26806 
20.43791 
310.52333 
174.20918 
101.06791 
0 
92.40876 
127.82177 
99.90673 
133.06721 
27.00096 
29.21632 
92.40876 
0 
33.84280 
6.31997 
252.75503 
60.52076 
29.26806 
127.82177 
33.84280 
0 
46.37326 
132.05820 
26.54855 
20.43791 
99.90673 
6.31997 
46.37326 
0 
The coefficients of the linear discriminant function are displayed (in Figure 32.4) with the default options METHOD=NORMAL and POOL=YES.
Figure 32.4
Linear Discriminant Function
185.91682 
64.92517 
48.68009 
148.06402 
62.65963 
19.70401 
67.44603 
0.10912 
0.09031 
0.09418 
0.13805 
0.09901 
0.05778 
0.09948 
23.02273 
13.64180 
19.45368 
20.92442 
14.63635 
4.09257 
22.57117 
26.70692 
5.38195 
17.33061 
6.19887 
7.47195 
3.63996 
3.83450 
50.55780 
20.89531 
5.25993 
22.94989 
25.00702 
10.60171 
21.12638 
13.91638 
8.44567 
1.42833 
8.99687 
0.26083 
1.84569 
0.64957 
23.71895 
13.38592 
1.32749 
9.13410 
3.74542 
3.43630 
2.52442 
A summary of how the discriminant function classifies the data used to develop the function is displayed last. In Figure 32.5, you see that only three of the observations are misclassified. The errorcount estimates give the proportion of misclassified observations in each group. Since you are classifying the same data that are used to derive the discriminant function, these errorcount estimates are biased.
Figure 32.5
Resubstitution Misclassification Summary
The DISCRIM Procedure
Classification Summary for Calibration Data: SASHELP.FISH
Resubstitution Summary using Linear Discriminant Function
0.0000 
0.0000 
0.0536 
0.0000 
0.0000 
0.0000 
0.0000 
0.0077 
0.1429 
0.1429 
0.1429 
0.1429 
0.1429 
0.1429 
0.1429 

One way to reduce the bias of the errorcount estimates is to split your data into two sets. One set is used to derive the discriminant function, and the other set is used to run validation tests. Example 32.4 shows how to analyze a test data set. Another method of reducing bias is to classify each observation by using a discriminant function computed from all of the other observations; this method is invoked with the CROSSVALIDATE option.