# The DISCRIM Procedure

## Getting Started: DISCRIM Procedure

The data in this example are measurements of 159 fish caught in Finland’s lake Laengelmavesi; this data set is available from the Puranen. For each of the seven species (bream, roach, whitefish, parkki, perch, pike, and smelt) the weight, length, height, and width of each fish are tallied. Three different length measurements are recorded: from the nose of the fish to the beginning of its tail, from the nose to the notch of its tail, and from the nose to the end of its tail. The height and width are recorded as percentages of the third length variable. The fish data set is available from the `Sashelp` library. The goal now is to find a discriminant function based on these six variables that best classifies the fish into species.

First, assume that the data are normally distributed within each group with equal covariances across groups. The following statements use PROC DISCRIM to analyze the `Sashelp.Fish` data and create Figure 33.1 through Figure 33.5:

```title 'Fish Measurement Data';

proc discrim data=sashelp.fish;
class Species;
run;
```

The DISCRIM procedure begins by displaying summary information about the variables in the analysis (see Figure 33.1). This information includes the number of observations, the number of quantitative variables in the analysis (specified with the VAR statement), and the number of classes in the classification variable (specified with the CLASS statement). The frequency of each class, its weight, the proportion of the total sample, and the prior probability are also displayed. Equal priors are assigned by default.

Figure 33.1: Summary Information

 Fish Measurement Data

The DISCRIM Procedure

 Total Sample Size DF Total 158 157 6 151 7 6

 Number of Observations Read 159 158

Class Level Information
Species Variable
Name
Frequency Weight Proportion Prior
Probability
Bream Bream 34 34.0000 0.215190 0.142857
Parkki Parkki 11 11.0000 0.069620 0.142857
Perch Perch 56 56.0000 0.354430 0.142857
Pike Pike 17 17.0000 0.107595 0.142857
Roach Roach 20 20.0000 0.126582 0.142857
Smelt Smelt 14 14.0000 0.088608 0.142857
Whitefish Whitefish 6 6.0000 0.037975 0.142857

The natural log of the determinant of the pooled covariance matrix is displayed in Figure 33.2.

Figure 33.2: Pooled Covariance Matrix Information

Pooled Covariance Matrix
Information
Covariance
Matrix Rank
Natural Log of the
Determinant of the
Covariance Matrix
6 4.17613

The squared distances between the classes are shown in Figure 33.3.

Figure 33.3: Squared Distances

 Fish Measurement Data

The DISCRIM Procedure

Generalized Squared Distance to Species
From Species Bream Parkki Perch Pike Roach Smelt Whitefish
Bream 0 83.32523 243.66688 310.52333 133.06721 252.75503 132.05820
Parkki 83.32523 0 57.09760 174.20918 27.00096 60.52076 26.54855
Perch 243.66688 57.09760 0 101.06791 29.21632 29.26806 20.43791
Pike 310.52333 174.20918 101.06791 0 92.40876 127.82177 99.90673
Roach 133.06721 27.00096 29.21632 92.40876 0 33.84280 6.31997
Smelt 252.75503 60.52076 29.26806 127.82177 33.84280 0 46.37326
Whitefish 132.05820 26.54855 20.43791 99.90673 6.31997 46.37326 0

The coefficients of the linear discriminant function are displayed (in Figure 33.4) with the default options METHOD=NORMAL and POOL=YES.

Figure 33.4: Linear Discriminant Function

Linear Discriminant Function for Species
Variable Bream Parkki Perch Pike Roach Smelt Whitefish
Constant -185.91682 -64.92517 -48.68009 -148.06402 -62.65963 -19.70401 -67.44603
Weight -0.10912 -0.09031 -0.09418 -0.13805 -0.09901 -0.05778 -0.09948
Length1 -23.02273 -13.64180 -19.45368 -20.92442 -14.63635 -4.09257 -22.57117
Length2 -26.70692 -5.38195 17.33061 6.19887 -7.47195 -3.63996 3.83450
Length3 50.55780 20.89531 5.25993 22.94989 25.00702 10.60171 21.12638
Height 13.91638 8.44567 -1.42833 -8.99687 -0.26083 -1.84569 0.64957
Width -23.71895 -13.38592 1.32749 -9.13410 -3.74542 -3.43630 -2.52442

A summary of how the discriminant function classifies the data used to develop the function is displayed last. In Figure 33.5, you see that only three of the observations are misclassified. The error-count estimates give the proportion of misclassified observations in each group. Since you are classifying the same data that are used to derive the discriminant function, these error-count estimates are biased.

Figure 33.5: Resubstitution Misclassification Summary

 Fish Measurement Data

The DISCRIM Procedure
Classification Summary for Calibration Data: SASHELP.FISH
Resubstitution Summary using Linear Discriminant Function

Number of Observations and Percent Classified into Species
From Species Bream Parkki Perch Pike Roach Smelt Whitefish Total
Bream
 34 100
 0 0
 0 0
 0 0
 0 0
 0 0
 0 0
 34 100
Parkki
 0 0
 11 100
 0 0
 0 0
 0 0
 0 0
 0 0
 11 100
Perch
 0 0
 0 0
 53 94.64
 0 0
 0 0
 3 5.36
 0 0
 56 100
Pike
 0 0
 0 0
 0 0
 17 100
 0 0
 0 0
 0 0
 17 100
Roach
 0 0
 0 0
 0 0
 0 0
 20 100
 0 0
 0 0
 20 100
Smelt
 0 0
 0 0
 0 0
 0 0
 0 0
 14 100
 0 0
 14 100
Whitefish
 0 0
 0 0
 0 0
 0 0
 0 0
 0 0
 6 100
 6 100
Total
 34 21.52
 11 6.96
 53 33.54
 17 10.76
 20 12.66
 17 10.76
 6 3.8
 158 100
Priors
 0.14286
 0.14286
 0.14286
 0.14286
 0.14286
 0.14286
 0.14286

Error Count Estimates for Species
Bream Parkki Perch Pike Roach Smelt Whitefish Total
Rate 0.0000 0.0000 0.0536 0.0000 0.0000 0.0000 0.0000 0.0077
Priors 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429

One way to reduce the bias of the error-count estimates is to split your data into two sets. One set is used to derive the discriminant function, and the other set is used to run validation tests. Example 33.4 shows how to analyze a test data set. Another method of reducing bias is to classify each observation by using a discriminant function computed from all of the other observations; this method is invoked with the CROSSVALIDATE option.