Getting Started: STEPDISC Procedure

The data in this example are measurements of 159 fish caught in Finland’s lake Laengelmavesi; this data set is available from the Puranen. For each of the seven species (bream, roach, whitefish, parkki, perch, pike, and smelt) the weight, length, height, and width of each fish are tallied. Three different length measurements are recorded: from the nose of the fish to the beginning of its tail, from the nose to the notch of its tail, and from the nose to the end of its tail. The height and width are recorded as percentages of the third length variable. The fish data set is available from the Sashelp library. PROC STEPDISC will select a subset of the six quantitative variables that might be useful for differentiating between the fish species. This subset is used in conjunction with PROC CANDISC and PROC DISCRIM to develop discrimination models.

The following steps use PROC STEPDISC to select a subset of potential discriminator variables. By default, PROC STEPDISC uses stepwise selection on all numeric variables that are not listed in other statements, and the significance levels for a variable to enter the subset and to stay in the subset are set to 0.15. The following statements produce Figure 85.1 through Figure 85.5:

title 'Fish Measurement Data';

proc stepdisc data=sashelp.fish;
   class Species;
run;

PROC STEPDISC begins by displaying summary information about the analysis (see Figure 85.1). This information includes the number of observations with nonmissing values, the number of classes in the classification variable (specified by the CLASS statement), the number of quantitative variables under consideration, the significance criteria for variables to enter and to stay in the model, and the method of variable selection being used. The frequency of each class is also displayed.

Figure 85.1 Summary Information
Fish Measurement Data

The STEPDISC Procedure

The Method for Selecting Variables is STEPWISE
Total Sample Size 158 Variable(s) in the Analysis 6
Class Levels 7 Variable(s) Will Be Included 0
    Significance Level to Enter 0.15
    Significance Level to Stay 0.15

Number of Observations Read 159
Number of Observations Used 158

Class Level Information
Species Variable
Name
Frequency Weight Proportion
Bream Bream 34 34.0000 0.215190
Parkki Parkki 11 11.0000 0.069620
Perch Perch 56 56.0000 0.354430
Pike Pike 17 17.0000 0.107595
Roach Roach 20 20.0000 0.126582
Smelt Smelt 14 14.0000 0.088608
Whitefish Whitefish 6 6.0000 0.037975

For each entry step, the statistics for entry are displayed for all variables not currently selected (see Figure 85.2). The variable selected to enter at this step (if any) is displayed, as well as all the variables currently selected. Next are multivariate statistics that take into account all previously selected variables and the newly entered variable.

Figure 85.2 Step 1: Variable HEIGHT Selected for Entry
Fish Measurement Data

The STEPDISC Procedure
Stepwise Selection: Step 1

Statistics for Entry, DF = 6, 151
Variable R-Square F Value Pr > F Tolerance
Weight 0.3750 15.10 <.0001 1.0000
Length1 0.6017 38.02 <.0001 1.0000
Length2 0.6098 39.32 <.0001 1.0000
Length3 0.6280 42.49 <.0001 1.0000
Height 0.7553 77.69 <.0001 1.0000
Width 0.4806 23.29 <.0001 1.0000

Variable Height will be entered.

Variable(s)
That Have Been
Entered
Height

Multivariate Statistics
Statistic Value F Value Num DF Den DF Pr > F
Wilks' Lambda 0.244670 77.69 6 151 <.0001
Pillai's Trace 0.755330 77.69 6 151 <.0001
Average Squared Canonical Correlation 0.125888        

For each removal step (Figure 85.3), the statistics for removal are displayed for all variables currently entered. The variable to be removed at this step (if any) is displayed. If no variable meets the criterion to be removed and the maximum number of steps as specified by the MAXSTEP= option has not been attained, then the procedure continues with another entry step.

Figure 85.3 Step 2: No Variable Is Removed; Variable Length2 Added
Fish Measurement Data

The STEPDISC Procedure
Stepwise Selection: Step 2

Statistics for Removal, DF = 6,
151
Variable R-Square F Value Pr > F
Height 0.7553 77.69 <.0001

No variables can be removed.

Statistics for Entry, DF = 6, 150
Variable Partial
R-Square
F Value Pr > F Tolerance
Weight 0.7388 70.71 <.0001 0.4690
Length1 0.9220 295.35 <.0001 0.6083
Length2 0.9229 299.31 <.0001 0.5892
Length3 0.9173 277.37 <.0001 0.5056
Width 0.8783 180.44 <.0001 0.3699

Variable Length2 will be entered.

Variable(s) That Have
Been Entered
Length2 Height

Multivariate Statistics
Statistic Value F Value Num DF Den DF Pr > F
Wilks' Lambda 0.018861 157.04 12 300 <.0001
Pillai's Trace 1.554349 87.78 12 302 <.0001
Average Squared Canonical Correlation 0.259058        

The stepwise procedure terminates either when no variable can be removed and no variable can be entered or when the maximum number of steps as specified by the MAXSTEP= option has been attained. In this example at step 7 no variables can be either removed or entered (Figure 85.4). Steps 3 through 6 are not displayed in this document.

Figure 85.4 Step 7: No Variables Entered or Removed
Fish Measurement Data

The STEPDISC Procedure
Stepwise Selection: Step 7

Statistics for Removal, DF = 6,
146
Variable Partial
R-Square
F Value Pr > F
Weight 0.4521 20.08 <.0001
Length1 0.2987 10.36 <.0001
Length2 0.5250 26.89 <.0001
Length3 0.7948 94.25 <.0001
Height 0.7257 64.37 <.0001
Width 0.5757 33.02 <.0001

No variables can be removed.

PROC STEPDISC ends by displaying a summary of the steps.

Figure 85.5 Step Summary
No further steps are possible.

Fish Measurement Data

The STEPDISC Procedure

Stepwise Selection Summary
Step Number
In
Entered Removed Partial
R-Square
F Value Pr > F Wilks'
Lambda
Pr <
Lambda
Average
Squared
Canonical
Correlation
Pr >
ASCC
1 1 Height   0.7553 77.69 <.0001 0.24466983 <.0001 0.12588836 <.0001
2 2 Length2   0.9229 299.31 <.0001 0.01886065 <.0001 0.25905822 <.0001
3 3 Length3   0.8826 186.77 <.0001 0.00221342 <.0001 0.38427100 <.0001
4 4 Width   0.5775 33.72 <.0001 0.00093510 <.0001 0.45200732 <.0001
5 5 Weight   0.4461 19.73 <.0001 0.00051794 <.0001 0.49488458 <.0001
6 6 Length1   0.2987 10.36 <.0001 0.00036325 <.0001 0.51744189 <.0001

All the variables in the data set are found to have potential discriminatory power. These variables are used to develop discrimination models in both the CANDISC and DISCRIM procedure chapters.