The STEPDISC Procedure |

The data in this example are measurements of 159 fish caught in Finland’s lake Laengelmavesi; this data set is available from the Puranen. For each of the seven species (bream, roach, whitefish, parkki, perch, pike, and smelt) the weight, length, height, and width of each fish are tallied. Three different length measurements are recorded: from the nose of the fish to the beginning of its tail, from the nose to the notch of its tail, and from the nose to the end of its tail. The height and width are recorded as percentages of the third length variable. PROC STEPDISC will select a subset of the six quantitative variables that might be useful for differentiating between the fish species. This subset is used in conjunction with PROC CANDISC and PROC DISCRIM to develop discrimination models.

The following steps create the data set fish and use PROC STEPDISC to select a subset of potential discriminator variables. By default, PROC STEPDISC uses stepwise selection on all numeric variables that are not listed in other statements, and the significance levels for a variable to enter the subset and to stay in the subset are set to 0.15. The following statements produce Figure 82.1 through Figure 82.5:

title 'Fish Measurement Data'; proc format; value specfmt 1='Bream' 2='Roach' 3='Whitefish' 4='Parkki' 5='Perch' 6='Pike' 7='Smelt'; run; data fish (drop=HtPct WidthPct); input Species Weight Length1 Length2 Length3 HtPct WidthPct @@; Height=HtPct*Length3/100; Width=WidthPct*Length3/100; format Species specfmt.; datalines; 1 242.0 23.2 25.4 30.0 38.4 13.4 1 290.0 24.0 26.3 31.2 40.0 13.8 1 340.0 23.9 26.5 31.1 39.8 15.1 1 363.0 26.3 29.0 33.5 38.0 13.3 1 430.0 26.5 29.0 34.0 36.6 15.1 1 450.0 26.8 29.7 34.7 39.2 14.2 ... more lines ... 7 19.7 13.2 14.3 15.2 18.9 13.6 7 19.9 13.8 15.0 16.2 18.1 11.6 ;

proc stepdisc data=fish; class Species; run;

PROC STEPDISC begins by displaying summary information about the analysis (see Figure 82.1). This information includes the number of observations with nonmissing values, the number of classes in the classification variable (specified by the CLASS statement), the number of quantitative variables under consideration, the significance criteria for variables to enter and to stay in the model, and the method of variable selection being used. The frequency of each class is also displayed.

The Method for Selecting Variables is STEPWISE | |||
---|---|---|---|

Total Sample Size | 158 | Variable(s) in the Analysis | 6 |

Class Levels | 7 | Variable(s) Will Be Included | 0 |

Significance Level to Enter | 0.15 | ||

Significance Level to Stay | 0.15 |

Class Level Information | ||||
---|---|---|---|---|

Species | Variable Name |
Frequency | Weight | Proportion |

Bream | Bream | 34 | 34.0000 | 0.215190 |

Parkki | Parkki | 11 | 11.0000 | 0.069620 |

Perch | Perch | 56 | 56.0000 | 0.354430 |

Pike | Pike | 17 | 17.0000 | 0.107595 |

Roach | Roach | 20 | 20.0000 | 0.126582 |

Smelt | Smelt | 14 | 14.0000 | 0.088608 |

Whitefish | Whitefish | 6 | 6.0000 | 0.037975 |

For each entry step, the statistics for entry are displayed for all variables not currently selected (see Figure 82.2). The variable selected to enter at this step (if any) is displayed, as well as all the variables currently selected. Next are multivariate statistics that take into account all previously selected variables and the newly entered variable.

Statistics for Entry, DF = 6, 151 | ||||
---|---|---|---|---|

Variable | R-Square | F Value | Pr > F | Tolerance |

Weight | 0.3750 | 15.10 | <.0001 | 1.0000 |

Length1 | 0.6017 | 38.02 | <.0001 | 1.0000 |

Length2 | 0.6098 | 39.32 | <.0001 | 1.0000 |

Length3 | 0.6280 | 42.49 | <.0001 | 1.0000 |

Height | 0.7553 | 77.69 | <.0001 | 1.0000 |

Width | 0.4806 | 23.29 | <.0001 | 1.0000 |

Multivariate Statistics | |||||
---|---|---|---|---|---|

Statistic | Value | F Value | Num DF | Den DF | Pr > F |

Wilks' Lambda | 0.244670 | 77.69 | 6 | 151 | <.0001 |

Pillai's Trace | 0.755330 | 77.69 | 6 | 151 | <.0001 |

Average Squared Canonical Correlation | 0.125888 |

For each removal step (Figure 82.3), the statistics for removal are displayed for all variables currently entered. The variable to be removed at this step (if any) is displayed. If no variable meets the criterion to be removed and the maximum number of steps as specified by the MAXSTEP= option has not been attained, then the procedure continues with another entry step.

Statistics for Removal, DF = 6, 151 |
|||
---|---|---|---|

Variable | R-Square | F Value | Pr > F |

Height | 0.7553 | 77.69 | <.0001 |

Statistics for Entry, DF = 6, 150 | ||||
---|---|---|---|---|

Variable | Partial R-Square |
F Value | Pr > F | Tolerance |

Weight | 0.7388 | 70.71 | <.0001 | 0.4690 |

Length1 | 0.9220 | 295.35 | <.0001 | 0.6083 |

Length2 | 0.9229 | 299.31 | <.0001 | 0.5892 |

Length3 | 0.9173 | 277.37 | <.0001 | 0.5056 |

Width | 0.8783 | 180.44 | <.0001 | 0.3699 |

Multivariate Statistics | |||||
---|---|---|---|---|---|

Statistic | Value | F Value | Num DF | Den DF | Pr > F |

Wilks' Lambda | 0.018861 | 157.04 | 12 | 300 | <.0001 |

Pillai's Trace | 1.554349 | 87.78 | 12 | 302 | <.0001 |

Average Squared Canonical Correlation | 0.259058 |

The stepwise procedure terminates either when no variable can be removed and no variable can be entered or when the maximum number of steps as specified by the MAXSTEP= option has been attained. In this example at step 7 no variables can be either removed or entered (Figure 82.4). Steps 3 through 6 are not displayed in this document.

Statistics for Removal, DF = 6, 146 |
|||
---|---|---|---|

Variable | Partial R-Square |
F Value | Pr > F |

Weight | 0.4521 | 20.08 | <.0001 |

Length1 | 0.2987 | 10.36 | <.0001 |

Length2 | 0.5250 | 26.89 | <.0001 |

Length3 | 0.7948 | 94.25 | <.0001 |

Height | 0.7257 | 64.37 | <.0001 |

Width | 0.5757 | 33.02 | <.0001 |

PROC STEPDISC ends by displaying a summary of the steps.

Stepwise Selection Summary | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|

Step | Number In |
Entered | Removed | Partial R-Square |
F Value | Pr > F | Wilks' Lambda |
Pr < Lambda |
Average Squared Canonical Correlation |
Pr > ASCC |

1 | 1 | Height | 0.7553 | 77.69 | <.0001 | 0.24466983 | <.0001 | 0.12588836 | <.0001 | |

2 | 2 | Length2 | 0.9229 | 299.31 | <.0001 | 0.01886065 | <.0001 | 0.25905822 | <.0001 | |

3 | 3 | Length3 | 0.8826 | 186.77 | <.0001 | 0.00221342 | <.0001 | 0.38427100 | <.0001 | |

4 | 4 | Width | 0.5775 | 33.72 | <.0001 | 0.00093510 | <.0001 | 0.45200732 | <.0001 | |

5 | 5 | Weight | 0.4461 | 19.73 | <.0001 | 0.00051794 | <.0001 | 0.49488458 | <.0001 | |

6 | 6 | Length1 | 0.2987 | 10.36 | <.0001 | 0.00036325 | <.0001 | 0.51744189 | <.0001 |

All the variables in the data set are found to have potential discriminatory power. These variables are used to develop discrimination models in both the CANDISC and DISCRIM procedure chapters.

Copyright © 2009 by SAS Institute Inc., Cary, NC, USA. All rights reserved.