The data in this example are measurements of 159 fish caught in Finland’s Lake Laengelmaevesi; this data set is available
from the Puranen. For each of the seven species (bream, roach, whitefish, parkki, perch, pike, and smelt) the weight, length, height, and
width of each fish are tallied. Three different length measurements are recorded: from the nose of the fish to the beginning
of its tail, from the nose to the notch of its tail, and from the nose to the end of its tail. The height and width are recorded
as percentages of the third length variable. The fish data set is available from the Sashelp
library. PROC STEPDISC will select a subset of the six quantitative variables that might be useful for differentiating between
the fish species. This subset is used in conjunction with PROC CANDISC and PROC DISCRIM to develop discrimination models.
The following steps use PROC STEPDISC to select a subset of potential discriminator variables. By default, PROC STEPDISC uses stepwise selection on all numeric variables that are not listed in other statements, and the significance levels for a variable to enter the subset and to stay in the subset are set to 0.15. The following statements produce Figure 108.1 through Figure 108.5:
title 'Fish Measurement Data'; proc stepdisc data=sashelp.fish; class Species; run;
PROC STEPDISC begins by displaying summary information about the analysis (see Figure 108.1). This information includes the number of observations with nonmissing values, the number of classes in the classification variable (specified by the CLASS statement), the number of quantitative variables under consideration, the significance criteria for variables to enter and to stay in the model, and the method of variable selection being used. The frequency of each class is also displayed.
Figure 108.1: Summary Information
Class Level Information | ||||
---|---|---|---|---|
Species | Variable Name |
Frequency | Weight | Proportion |
Bream | Bream | 34 | 34.0000 | 0.215190 |
Parkki | Parkki | 11 | 11.0000 | 0.069620 |
Perch | Perch | 56 | 56.0000 | 0.354430 |
Pike | Pike | 17 | 17.0000 | 0.107595 |
Roach | Roach | 20 | 20.0000 | 0.126582 |
Smelt | Smelt | 14 | 14.0000 | 0.088608 |
Whitefish | Whitefish | 6 | 6.0000 | 0.037975 |
For each entry step, the statistics for entry are displayed for all variables not currently selected (see Figure 108.2). The variable selected to enter at this step (if any) is displayed, as well as all the variables currently selected. Next are multivariate statistics that take into account all previously selected variables and the newly entered variable.
Figure 108.2: Step 1: Variable HEIGHT Selected for Entry
Fish Measurement Data |
Statistics for Entry, DF = 6, 151 | ||||
---|---|---|---|---|
Variable | R-Square | F Value | Pr > F | Tolerance |
Weight | 0.3750 | 15.10 | <.0001 | 1.0000 |
Length1 | 0.6017 | 38.02 | <.0001 | 1.0000 |
Length2 | 0.6098 | 39.32 | <.0001 | 1.0000 |
Length3 | 0.6280 | 42.49 | <.0001 | 1.0000 |
Height | 0.7553 | 77.69 | <.0001 | 1.0000 |
Width | 0.4806 | 23.29 | <.0001 | 1.0000 |
For each removal step (Figure 108.3), the statistics for removal are displayed for all variables currently entered. The variable to be removed at this step (if any) is displayed. If no variable meets the criterion to be removed and the maximum number of steps as specified by the MAXSTEP= option has not been attained, then the procedure continues with another entry step.
Figure 108.3: Step 2: No Variable Is Removed; Variable Length2 Added
The stepwise procedure terminates either when no variable can be removed and no variable can be entered or when the maximum number of steps as specified by the MAXSTEP= option has been attained. In this example at step 7 no variables can be either removed or entered (Figure 108.4). Steps 3 through 6 are not displayed in this document.
Figure 108.4: Step 7: No Variables Entered or Removed
Fish Measurement Data |
Statistics for Removal, DF = 6, 146 |
|||
---|---|---|---|
Variable | Partial R-Square |
F Value | Pr > F |
Weight | 0.4521 | 20.08 | <.0001 |
Length1 | 0.2987 | 10.36 | <.0001 |
Length2 | 0.5250 | 26.89 | <.0001 |
Length3 | 0.7948 | 94.25 | <.0001 |
Height | 0.7257 | 64.37 | <.0001 |
Width | 0.5757 | 33.02 | <.0001 |
PROC STEPDISC ends by displaying a summary of the steps.
Figure 108.5: Step Summary
Fish Measurement Data |
Stepwise Selection Summary | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Step | Number In |
Entered | Removed | Partial R-Square |
F Value | Pr > F | Wilks' Lambda |
Pr < Lambda |
Average Squared Canonical Correlation |
Pr > ASCC |
1 | 1 | Height | 0.7553 | 77.69 | <.0001 | 0.24466983 | <.0001 | 0.12588836 | <.0001 | |
2 | 2 | Length2 | 0.9229 | 299.31 | <.0001 | 0.01886065 | <.0001 | 0.25905822 | <.0001 | |
3 | 3 | Length3 | 0.8826 | 186.77 | <.0001 | 0.00221342 | <.0001 | 0.38427100 | <.0001 | |
4 | 4 | Width | 0.5775 | 33.72 | <.0001 | 0.00093510 | <.0001 | 0.45200732 | <.0001 | |
5 | 5 | Weight | 0.4461 | 19.73 | <.0001 | 0.00051794 | <.0001 | 0.49488458 | <.0001 | |
6 | 6 | Length1 | 0.2987 | 10.36 | <.0001 | 0.00036325 | <.0001 | 0.51744189 | <.0001 |
All the variables in the data set are found to have potential discriminatory power. These variables are used to develop discrimination models in both the CANDISC and DISCRIM procedure chapters.