The data in this example are measurements of 159 fish caught in Finland’s Lake Laengelmaevesi; this data set is available
from Puranen (1917). For each of the seven species (bream, roach, whitefish, parkki, perch, pike, and smelt) the weight, length, height, and
width of each fish are tallied. Three different length measurements are recorded: from the nose of the fish to the beginning
of its tail, from the nose to the notch of its tail, and from the nose to the end of its tail. The height and width are recorded
as percentages of the third length variable. The fish data set is available from the Sashelp
library. The goal now is to find a discriminant function based on these six variables that best classifies the fish into
species.
First, assume that the data are normally distributed within each group with equal covariances across groups. The following
statements use PROC DISCRIM to analyze the Sashelp.Fish
data and create Figure 35.1 through Figure 35.5:
title 'Fish Measurement Data'; proc discrim data=sashelp.fish; class Species; run;
The DISCRIM procedure begins by displaying summary information about the variables in the analysis (see Figure 35.1). This information includes the number of observations, the number of quantitative variables in the analysis (specified with the VAR statement), and the number of classes in the classification variable (specified with the CLASS statement). The frequency of each class, its weight, the proportion of the total sample, and the prior probability are also displayed. Equal priors are assigned by default.
Figure 35.1: Summary Information
Class Level Information  

Species  Variable Name 
Frequency  Weight  Proportion  Prior Probability 
Bream  Bream  34  34.0000  0.215190  0.142857 
Parkki  Parkki  11  11.0000  0.069620  0.142857 
Perch  Perch  56  56.0000  0.354430  0.142857 
Pike  Pike  17  17.0000  0.107595  0.142857 
Roach  Roach  20  20.0000  0.126582  0.142857 
Smelt  Smelt  14  14.0000  0.088608  0.142857 
Whitefish  Whitefish  6  6.0000  0.037975  0.142857 
The natural log of the determinant of the pooled covariance matrix is displayed in Figure 35.2.
Figure 35.2: Pooled Covariance Matrix Information
The squared distances between the classes are shown in Figure 35.3.
Figure 35.3: Squared Distances
Fish Measurement Data 
Generalized Squared Distance to Species  

From Species  Bream  Parkki  Perch  Pike  Roach  Smelt  Whitefish 
Bream  0  83.32523  243.66688  310.52333  133.06721  252.75503  132.05820 
Parkki  83.32523  0  57.09760  174.20918  27.00096  60.52076  26.54855 
Perch  243.66688  57.09760  0  101.06791  29.21632  29.26806  20.43791 
Pike  310.52333  174.20918  101.06791  0  92.40876  127.82177  99.90673 
Roach  133.06721  27.00096  29.21632  92.40876  0  33.84280  6.31997 
Smelt  252.75503  60.52076  29.26806  127.82177  33.84280  0  46.37326 
Whitefish  132.05820  26.54855  20.43791  99.90673  6.31997  46.37326  0 
The coefficients of the linear discriminant function are displayed (in Figure 35.4) with the default options METHOD=NORMAL and POOL=YES.
Figure 35.4: Linear Discriminant Function
Linear Discriminant Function for Species  

Variable  Bream  Parkki  Perch  Pike  Roach  Smelt  Whitefish 
Constant  185.91682  64.92517  48.68009  148.06402  62.65963  19.70401  67.44603 
Weight  0.10912  0.09031  0.09418  0.13805  0.09901  0.05778  0.09948 
Length1  23.02273  13.64180  19.45368  20.92442  14.63635  4.09257  22.57117 
Length2  26.70692  5.38195  17.33061  6.19887  7.47195  3.63996  3.83450 
Length3  50.55780  20.89531  5.25993  22.94989  25.00702  10.60171  21.12638 
Height  13.91638  8.44567  1.42833  8.99687  0.26083  1.84569  0.64957 
Width  23.71895  13.38592  1.32749  9.13410  3.74542  3.43630  2.52442 
A summary of how the discriminant function classifies the data used to develop the function is displayed last. In Figure 35.5, you see that only three of the observations are misclassified. The errorcount estimates give the proportion of misclassified observations in each group. Since you are classifying the same data that are used to derive the discriminant function, these errorcount estimates are biased.
Figure 35.5: Resubstitution Misclassification Summary
Fish Measurement Data 
Number of Observations and Percent Classified into Species  

From Species  Bream  Parkki  Perch  Pike  Roach  Smelt  Whitefish  Total  
Bream 









Parkki 









Perch 









Pike 









Roach 









Smelt 









Whitefish 









Total 









Priors 








One way to reduce the bias of the errorcount estimates is to split your data into two sets. One set is used to derive the discriminant function, and the other set is used to run validation tests. Example 35.4 shows how to analyze a test data set. Another method of reducing bias is to classify each observation by using a discriminant function computed from all of the other observations; this method is invoked with the CROSSVALIDATE option.