The data in this example are measurements of 159 fish caught in Finland’s lake Laengelmavesi; this data set is available from the Puranen. For each of the seven species (bream, roach, whitefish, parkki, perch, pike, and smelt) the weight, length, height, and width of each fish are tallied. Three different length measurements are recorded: from the nose of the fish to the beginning of its tail, from the nose to the notch of its tail, and from the nose to the end of its tail. The height and width are recorded as percentages of the third length variable. The fish data set is available from the Sashelp library. PROC STEPDISC will select a subset of the six quantitative variables that might be useful for differentiating between the fish species. This subset is used in conjunction with PROC CANDISC and PROC DISCRIM to develop discrimination models.
The following steps use PROC STEPDISC to select a subset of potential discriminator variables. By default, PROC STEPDISC uses stepwise selection on all numeric variables that are not listed in other statements, and the significance levels for a variable to enter the subset and to stay in the subset are set to 0.15. The following statements produce Figure 85.1 through Figure 85.5:
title 'Fish Measurement Data'; proc stepdisc data=sashelp.fish; class Species; run;
PROC STEPDISC begins by displaying summary information about the analysis (see Figure 85.1). This information includes the number of observations with nonmissing values, the number of classes in the classification variable (specified by the CLASS statement), the number of quantitative variables under consideration, the significance criteria for variables to enter and to stay in the model, and the method of variable selection being used. The frequency of each class is also displayed.
Fish Measurement Data 
The Method for Selecting Variables is STEPWISE  

Total Sample Size  158  Variable(s) in the Analysis  6 
Class Levels  7  Variable(s) Will Be Included  0 
Significance Level to Enter  0.15  
Significance Level to Stay  0.15 
Number of Observations Read  159 

Number of Observations Used  158 
Class Level Information  

Species  Variable Name 
Frequency  Weight  Proportion 
Bream  Bream  34  34.0000  0.215190 
Parkki  Parkki  11  11.0000  0.069620 
Perch  Perch  56  56.0000  0.354430 
Pike  Pike  17  17.0000  0.107595 
Roach  Roach  20  20.0000  0.126582 
Smelt  Smelt  14  14.0000  0.088608 
Whitefish  Whitefish  6  6.0000  0.037975 
For each entry step, the statistics for entry are displayed for all variables not currently selected (see Figure 85.2). The variable selected to enter at this step (if any) is displayed, as well as all the variables currently selected. Next are multivariate statistics that take into account all previously selected variables and the newly entered variable.
Fish Measurement Data 
Statistics for Entry, DF = 6, 151  

Variable  RSquare  F Value  Pr > F  Tolerance 
Weight  0.3750  15.10  <.0001  1.0000 
Length1  0.6017  38.02  <.0001  1.0000 
Length2  0.6098  39.32  <.0001  1.0000 
Length3  0.6280  42.49  <.0001  1.0000 
Height  0.7553  77.69  <.0001  1.0000 
Width  0.4806  23.29  <.0001  1.0000 
Variable Height will be entered. 
Variable(s) That Have Been Entered 

Height 
Multivariate Statistics  

Statistic  Value  F Value  Num DF  Den DF  Pr > F 
Wilks' Lambda  0.244670  77.69  6  151  <.0001 
Pillai's Trace  0.755330  77.69  6  151  <.0001 
Average Squared Canonical Correlation  0.125888 
For each removal step (Figure 85.3), the statistics for removal are displayed for all variables currently entered. The variable to be removed at this step (if any) is displayed. If no variable meets the criterion to be removed and the maximum number of steps as specified by the MAXSTEP= option has not been attained, then the procedure continues with another entry step.
Fish Measurement Data 
Statistics for Removal, DF = 6, 151 


Variable  RSquare  F Value  Pr > F 
Height  0.7553  77.69  <.0001 
No variables can be removed. 
Statistics for Entry, DF = 6, 150  

Variable  Partial RSquare 
F Value  Pr > F  Tolerance 
Weight  0.7388  70.71  <.0001  0.4690 
Length1  0.9220  295.35  <.0001  0.6083 
Length2  0.9229  299.31  <.0001  0.5892 
Length3  0.9173  277.37  <.0001  0.5056 
Width  0.8783  180.44  <.0001  0.3699 
Variable Length2 will be entered. 
Variable(s) That Have Been Entered 


Length2  Height 
Multivariate Statistics  

Statistic  Value  F Value  Num DF  Den DF  Pr > F 
Wilks' Lambda  0.018861  157.04  12  300  <.0001 
Pillai's Trace  1.554349  87.78  12  302  <.0001 
Average Squared Canonical Correlation  0.259058 
The stepwise procedure terminates either when no variable can be removed and no variable can be entered or when the maximum number of steps as specified by the MAXSTEP= option has been attained. In this example at step 7 no variables can be either removed or entered (Figure 85.4). Steps 3 through 6 are not displayed in this document.
Fish Measurement Data 
Statistics for Removal, DF = 6, 146 


Variable  Partial RSquare 
F Value  Pr > F 
Weight  0.4521  20.08  <.0001 
Length1  0.2987  10.36  <.0001 
Length2  0.5250  26.89  <.0001 
Length3  0.7948  94.25  <.0001 
Height  0.7257  64.37  <.0001 
Width  0.5757  33.02  <.0001 
No variables can be removed. 
PROC STEPDISC ends by displaying a summary of the steps.
No further steps are possible. 
Fish Measurement Data 
Stepwise Selection Summary  

Step  Number In 
Entered  Removed  Partial RSquare 
F Value  Pr > F  Wilks' Lambda 
Pr < Lambda 
Average Squared Canonical Correlation 
Pr > ASCC 
1  1  Height  0.7553  77.69  <.0001  0.24466983  <.0001  0.12588836  <.0001  
2  2  Length2  0.9229  299.31  <.0001  0.01886065  <.0001  0.25905822  <.0001  
3  3  Length3  0.8826  186.77  <.0001  0.00221342  <.0001  0.38427100  <.0001  
4  4  Width  0.5775  33.72  <.0001  0.00093510  <.0001  0.45200732  <.0001  
5  5  Weight  0.4461  19.73  <.0001  0.00051794  <.0001  0.49488458  <.0001  
6  6  Length1  0.2987  10.36  <.0001  0.00036325  <.0001  0.51744189  <.0001 
All the variables in the data set are found to have potential discriminatory power. These variables are used to develop discrimination models in both the CANDISC and DISCRIM procedure chapters.