The data in this example are measurements of 159 fish caught in Finland’s lake Laengelmavesi; this data set is available from the Puranen. For each of the seven species (bream, roach, whitefish, parkki, perch, pike, and smelt) the weight, length, height, and width of each fish are tallied. Three different length measurements are recorded: from the nose of the fish to the beginning of its tail, from the nose to the notch of its tail, and from the nose to the end of its tail. The height and width are recorded as percentages of the third length variable. The fish data set is available from the Sashelp library. PROC STEPDISC will select a subset of the six quantitative variables that might be useful for differentiating between the fish species. This subset is used in conjunction with PROC CANDISC and PROC DISCRIM to develop discrimination models.
The following steps use PROC STEPDISC to select a subset of potential discriminator variables. By default, PROC STEPDISC uses stepwise selection on all numeric variables that are not listed in other statements, and the significance levels for a variable to enter the subset and to stay in the subset are set to 0.15. The following statements produce Figure 85.1 through Figure 85.5:
title 'Fish Measurement Data'; proc stepdisc data=sashelp.fish; class Species; run;
PROC STEPDISC begins by displaying summary information about the analysis (see Figure 85.1). This information includes the number of observations with nonmissing values, the number of classes in the classification variable (specified by the CLASS statement), the number of quantitative variables under consideration, the significance criteria for variables to enter and to stay in the model, and the method of variable selection being used. The frequency of each class is also displayed.
Fish Measurement Data |
The Method for Selecting Variables is STEPWISE | |||
---|---|---|---|
Total Sample Size | 158 | Variable(s) in the Analysis | 6 |
Class Levels | 7 | Variable(s) Will Be Included | 0 |
Significance Level to Enter | 0.15 | ||
Significance Level to Stay | 0.15 |
Number of Observations Read | 159 |
---|---|
Number of Observations Used | 158 |
Class Level Information | ||||
---|---|---|---|---|
Species | Variable Name |
Frequency | Weight | Proportion |
Bream | Bream | 34 | 34.0000 | 0.215190 |
Parkki | Parkki | 11 | 11.0000 | 0.069620 |
Perch | Perch | 56 | 56.0000 | 0.354430 |
Pike | Pike | 17 | 17.0000 | 0.107595 |
Roach | Roach | 20 | 20.0000 | 0.126582 |
Smelt | Smelt | 14 | 14.0000 | 0.088608 |
Whitefish | Whitefish | 6 | 6.0000 | 0.037975 |
For each entry step, the statistics for entry are displayed for all variables not currently selected (see Figure 85.2). The variable selected to enter at this step (if any) is displayed, as well as all the variables currently selected. Next are multivariate statistics that take into account all previously selected variables and the newly entered variable.
Fish Measurement Data |
Statistics for Entry, DF = 6, 151 | ||||
---|---|---|---|---|
Variable | R-Square | F Value | Pr > F | Tolerance |
Weight | 0.3750 | 15.10 | <.0001 | 1.0000 |
Length1 | 0.6017 | 38.02 | <.0001 | 1.0000 |
Length2 | 0.6098 | 39.32 | <.0001 | 1.0000 |
Length3 | 0.6280 | 42.49 | <.0001 | 1.0000 |
Height | 0.7553 | 77.69 | <.0001 | 1.0000 |
Width | 0.4806 | 23.29 | <.0001 | 1.0000 |
Variable Height will be entered. |
Variable(s) That Have Been Entered |
---|
Height |
Multivariate Statistics | |||||
---|---|---|---|---|---|
Statistic | Value | F Value | Num DF | Den DF | Pr > F |
Wilks' Lambda | 0.244670 | 77.69 | 6 | 151 | <.0001 |
Pillai's Trace | 0.755330 | 77.69 | 6 | 151 | <.0001 |
Average Squared Canonical Correlation | 0.125888 |
For each removal step (Figure 85.3), the statistics for removal are displayed for all variables currently entered. The variable to be removed at this step (if any) is displayed. If no variable meets the criterion to be removed and the maximum number of steps as specified by the MAXSTEP= option has not been attained, then the procedure continues with another entry step.
Fish Measurement Data |
Statistics for Removal, DF = 6, 151 |
|||
---|---|---|---|
Variable | R-Square | F Value | Pr > F |
Height | 0.7553 | 77.69 | <.0001 |
No variables can be removed. |
Statistics for Entry, DF = 6, 150 | ||||
---|---|---|---|---|
Variable | Partial R-Square |
F Value | Pr > F | Tolerance |
Weight | 0.7388 | 70.71 | <.0001 | 0.4690 |
Length1 | 0.9220 | 295.35 | <.0001 | 0.6083 |
Length2 | 0.9229 | 299.31 | <.0001 | 0.5892 |
Length3 | 0.9173 | 277.37 | <.0001 | 0.5056 |
Width | 0.8783 | 180.44 | <.0001 | 0.3699 |
Variable Length2 will be entered. |
Variable(s) That Have Been Entered |
|
---|---|
Length2 | Height |
Multivariate Statistics | |||||
---|---|---|---|---|---|
Statistic | Value | F Value | Num DF | Den DF | Pr > F |
Wilks' Lambda | 0.018861 | 157.04 | 12 | 300 | <.0001 |
Pillai's Trace | 1.554349 | 87.78 | 12 | 302 | <.0001 |
Average Squared Canonical Correlation | 0.259058 |
The stepwise procedure terminates either when no variable can be removed and no variable can be entered or when the maximum number of steps as specified by the MAXSTEP= option has been attained. In this example at step 7 no variables can be either removed or entered (Figure 85.4). Steps 3 through 6 are not displayed in this document.
Fish Measurement Data |
Statistics for Removal, DF = 6, 146 |
|||
---|---|---|---|
Variable | Partial R-Square |
F Value | Pr > F |
Weight | 0.4521 | 20.08 | <.0001 |
Length1 | 0.2987 | 10.36 | <.0001 |
Length2 | 0.5250 | 26.89 | <.0001 |
Length3 | 0.7948 | 94.25 | <.0001 |
Height | 0.7257 | 64.37 | <.0001 |
Width | 0.5757 | 33.02 | <.0001 |
No variables can be removed. |
PROC STEPDISC ends by displaying a summary of the steps.
No further steps are possible. |
Fish Measurement Data |
Stepwise Selection Summary | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Step | Number In |
Entered | Removed | Partial R-Square |
F Value | Pr > F | Wilks' Lambda |
Pr < Lambda |
Average Squared Canonical Correlation |
Pr > ASCC |
1 | 1 | Height | 0.7553 | 77.69 | <.0001 | 0.24466983 | <.0001 | 0.12588836 | <.0001 | |
2 | 2 | Length2 | 0.9229 | 299.31 | <.0001 | 0.01886065 | <.0001 | 0.25905822 | <.0001 | |
3 | 3 | Length3 | 0.8826 | 186.77 | <.0001 | 0.00221342 | <.0001 | 0.38427100 | <.0001 | |
4 | 4 | Width | 0.5775 | 33.72 | <.0001 | 0.00093510 | <.0001 | 0.45200732 | <.0001 | |
5 | 5 | Weight | 0.4461 | 19.73 | <.0001 | 0.00051794 | <.0001 | 0.49488458 | <.0001 | |
6 | 6 | Length1 | 0.2987 | 10.36 | <.0001 | 0.00036325 | <.0001 | 0.51744189 | <.0001 |
All the variables in the data set are found to have potential discriminatory power. These variables are used to develop discrimination models in both the CANDISC and DISCRIM procedure chapters.