The data in this example are measurements of 159 fish caught in Finland’s lake Laengelmavesi; this data set is available from
the Puranen. For each of the seven species (bream, roach, whitefish, parkki, perch, pike, and smelt) the weight, length, height, and
width of each fish are tallied. Three different length measurements are recorded: from the nose of the fish to the beginning
of its tail, from the nose to the notch of its tail, and from the nose to the end of its tail. The height and width are recorded
as percentages of the third length variable. The fish data set is available from the Sashelp
library. The goal now is to find a discriminant function based on these six variables that best classifies the fish into
species.
First, assume that the data are normally distributed within each group with equal covariances across groups. The following
statements use PROC DISCRIM to analyze the Sashelp.Fish
data and create Figure 33.1 through Figure 33.5:
title 'Fish Measurement Data'; proc discrim data=sashelp.fish; class Species; run;
The DISCRIM procedure begins by displaying summary information about the variables in the analysis (see Figure 33.1). This information includes the number of observations, the number of quantitative variables in the analysis (specified with the VAR statement), and the number of classes in the classification variable (specified with the CLASS statement). The frequency of each class, its weight, the proportion of the total sample, and the prior probability are also displayed. Equal priors are assigned by default.
Figure 33.1: Summary Information
Fish Measurement Data 
Total Sample Size  158  DF Total  157 

Variables  6  DF Within Classes  151 
Classes  7  DF Between Classes  6 
Number of Observations Read  159 

Number of Observations Used  158 
Class Level Information  

Species  Variable Name 
Frequency  Weight  Proportion  Prior Probability 
Bream  Bream  34  34.0000  0.215190  0.142857 
Parkki  Parkki  11  11.0000  0.069620  0.142857 
Perch  Perch  56  56.0000  0.354430  0.142857 
Pike  Pike  17  17.0000  0.107595  0.142857 
Roach  Roach  20  20.0000  0.126582  0.142857 
Smelt  Smelt  14  14.0000  0.088608  0.142857 
Whitefish  Whitefish  6  6.0000  0.037975  0.142857 
The natural log of the determinant of the pooled covariance matrix is displayed in Figure 33.2.
Figure 33.2: Pooled Covariance Matrix Information
Pooled Covariance Matrix Information 


Covariance Matrix Rank 
Natural Log of the Determinant of the Covariance Matrix 
6  4.17613 
The squared distances between the classes are shown in Figure 33.3.
Figure 33.3: Squared Distances
Fish Measurement Data 
Generalized Squared Distance to Species  

From Species  Bream  Parkki  Perch  Pike  Roach  Smelt  Whitefish 
Bream  0  83.32523  243.66688  310.52333  133.06721  252.75503  132.05820 
Parkki  83.32523  0  57.09760  174.20918  27.00096  60.52076  26.54855 
Perch  243.66688  57.09760  0  101.06791  29.21632  29.26806  20.43791 
Pike  310.52333  174.20918  101.06791  0  92.40876  127.82177  99.90673 
Roach  133.06721  27.00096  29.21632  92.40876  0  33.84280  6.31997 
Smelt  252.75503  60.52076  29.26806  127.82177  33.84280  0  46.37326 
Whitefish  132.05820  26.54855  20.43791  99.90673  6.31997  46.37326  0 
The coefficients of the linear discriminant function are displayed (in Figure 33.4) with the default options METHOD=NORMAL and POOL=YES.
Figure 33.4: Linear Discriminant Function
Linear Discriminant Function for Species  

Variable  Bream  Parkki  Perch  Pike  Roach  Smelt  Whitefish 
Constant  185.91682  64.92517  48.68009  148.06402  62.65963  19.70401  67.44603 
Weight  0.10912  0.09031  0.09418  0.13805  0.09901  0.05778  0.09948 
Length1  23.02273  13.64180  19.45368  20.92442  14.63635  4.09257  22.57117 
Length2  26.70692  5.38195  17.33061  6.19887  7.47195  3.63996  3.83450 
Length3  50.55780  20.89531  5.25993  22.94989  25.00702  10.60171  21.12638 
Height  13.91638  8.44567  1.42833  8.99687  0.26083  1.84569  0.64957 
Width  23.71895  13.38592  1.32749  9.13410  3.74542  3.43630  2.52442 
A summary of how the discriminant function classifies the data used to develop the function is displayed last. In Figure 33.5, you see that only three of the observations are misclassified. The errorcount estimates give the proportion of misclassified observations in each group. Since you are classifying the same data that are used to derive the discriminant function, these errorcount estimates are biased.
Figure 33.5: Resubstitution Misclassification Summary
Fish Measurement Data 
Number of Observations and Percent Classified into Species  

From Species  Bream  Parkki  Perch  Pike  Roach  Smelt  Whitefish  Total  
Bream 









Parkki 









Perch 









Pike 









Roach 









Smelt 









Whitefish 









Total 









Priors 








Error Count Estimates for Species  

Bream  Parkki  Perch  Pike  Roach  Smelt  Whitefish  Total  
Rate  0.0000  0.0000  0.0536  0.0000  0.0000  0.0000  0.0000  0.0077 
Priors  0.1429  0.1429  0.1429  0.1429  0.1429  0.1429  0.1429 
One way to reduce the bias of the errorcount estimates is to split your data into two sets. One set is used to derive the discriminant function, and the other set is used to run validation tests. Example 33.4 shows how to analyze a test data set. Another method of reducing bias is to classify each observation by using a discriminant function computed from all of the other observations; this method is invoked with the CROSSVALIDATE option.