The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length,
and petal width are measured in millimeters on 50 iris specimens from each of three species: Iris setosa, I. versicolor, and I. virginica. The iris data set is available from the Sashelp
library.
A stepwise discriminant analysis is performed by using stepwise selection.
In the PROC STEPDISC statement, the BSSCP and TSSCP options display the between-class SSCP matrix and the total-sample corrected SSCP matrix. By default, the significance level of an F test from an analysis of covariance is used as the selection criterion. The variable under consideration is the dependent variable, and the variables already chosen act as covariates. The following SAS statements produce Output 96.1.1 through Output 96.1.8:
title 'Fisher (1936) Iris Data'; %let _stdvar = ; proc stepdisc data=sashelp.iris bsscp tsscp; class Species; var SepalLength SepalWidth PetalLength PetalWidth; run;
Output 96.1.2: Iris Data: Between-Class and Total-Sample SSCP Matrices
Fisher (1936) Iris Data |
Between-Class SSCP Matrix | |||||
---|---|---|---|---|---|
Variable | Label | SepalLength | SepalWidth | PetalLength | PetalWidth |
SepalLength | Sepal Length (mm) | 6321.21333 | -1995.26667 | 16524.84000 | 7127.93333 |
SepalWidth | Sepal Width (mm) | -1995.26667 | 1134.49333 | -5723.96000 | -2293.26667 |
PetalLength | Petal Length (mm) | 16524.84000 | -5723.96000 | 43710.28000 | 18677.40000 |
PetalWidth | Petal Width (mm) | 7127.93333 | -2293.26667 | 18677.40000 | 8041.33333 |
Total-Sample SSCP Matrix | |||||
---|---|---|---|---|---|
Variable | Label | SepalLength | SepalWidth | PetalLength | PetalWidth |
SepalLength | Sepal Length (mm) | 10216.83333 | -632.26667 | 18987.30000 | 7692.43333 |
SepalWidth | Sepal Width (mm) | -632.26667 | 2830.69333 | -4911.88000 | -1812.42667 |
PetalLength | Petal Length (mm) | 18987.30000 | -4911.88000 | 46432.54000 | 19304.58000 |
PetalWidth | Petal Width (mm) | 7692.43333 | -1812.42667 | 19304.58000 | 8656.99333 |
In step 1, the tolerance is 1.0 for each variable under consideration because no variables have yet entered the model. The
variable PetalLength
is selected because its F statistic, 1180.161, is the largest among all variables.
Output 96.1.3: Iris Data: Stepwise Selection Step 1
Fisher (1936) Iris Data |
Statistics for Entry, DF = 2, 147 | |||||
---|---|---|---|---|---|
Variable | Label | R-Square | F Value | Pr > F | Tolerance |
SepalLength | Sepal Length (mm) | 0.6187 | 119.26 | <.0001 | 1.0000 |
SepalWidth | Sepal Width (mm) | 0.4008 | 49.16 | <.0001 | 1.0000 |
PetalLength | Petal Length (mm) | 0.9414 | 1180.16 | <.0001 | 1.0000 |
PetalWidth | Petal Width (mm) | 0.9289 | 960.01 | <.0001 | 1.0000 |
In step 2, with the variable PetalLength
already in the model, PetalLength
is tested for removal before a new variable is selected for entry. Since PetalLength
meets the criterion to stay, it is used as a covariate in the analysis of covariance for variable selection. The variable
SepalWidth
is selected because its F statistic, 43.035, is the largest among all variables not in the model and because its associated tolerance, 0.8164, meets
the criterion to enter. The process is repeated in steps 3 and 4. The variable PetalWidth
is entered in step 3, and the variable SepalLength
is entered in step 4.
Output 96.1.6: Iris Data: Stepwise Selection Step 4
Fisher (1936) Iris Data |
Statistics for Removal, DF = 2, 145 | ||||
---|---|---|---|---|
Variable | Label | Partial R-Square |
F Value | Pr > F |
SepalWidth | Sepal Width (mm) | 0.4295 | 54.58 | <.0001 |
PetalLength | Petal Length (mm) | 0.3482 | 38.72 | <.0001 |
PetalWidth | Petal Width (mm) | 0.3229 | 34.57 | <.0001 |
Since no more variables can be added to or removed from the model, the procedure stops at step 5 and displays a summary of the selection process.
Output 96.1.7: Iris Data: Stepwise Selection Step 5
Fisher (1936) Iris Data |
Statistics for Removal, DF = 2, 144 | ||||
---|---|---|---|---|
Variable | Label | Partial R-Square |
F Value | Pr > F |
SepalLength | Sepal Length (mm) | 0.0615 | 4.72 | 0.0103 |
SepalWidth | Sepal Width (mm) | 0.2335 | 21.94 | <.0001 |
PetalLength | Petal Length (mm) | 0.3308 | 35.59 | <.0001 |
PetalWidth | Petal Width (mm) | 0.2570 | 24.90 | <.0001 |
Output 96.1.8: Iris Data: Stepwise Selection Summary
Fisher (1936) Iris Data |
Stepwise Selection Summary | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Step | Number In |
Entered | Removed | Label | Partial R-Square |
F Value | Pr > F | Wilks' Lambda |
Pr < Lambda |
Average Squared Canonical Correlation |
Pr > ASCC |
1 | 1 | PetalLength | Petal Length (mm) | 0.9414 | 1180.16 | <.0001 | 0.05862828 | <.0001 | 0.47068586 | <.0001 | |
2 | 2 | SepalWidth | Sepal Width (mm) | 0.3709 | 43.04 | <.0001 | 0.03688411 | <.0001 | 0.55995394 | <.0001 | |
3 | 3 | PetalWidth | Petal Width (mm) | 0.3229 | 34.57 | <.0001 | 0.02497554 | <.0001 | 0.59495691 | <.0001 | |
4 | 4 | SepalLength | Sepal Length (mm) | 0.0615 | 4.72 | 0.0103 | 0.02343863 | <.0001 | 0.59594941 | <.0001 |
PROC STEPDISC automatically creates a list of the selected variables and stores it in a macro variable. You can submit the following statement to see the list of selected variables:
* print the macro variable list; %put &_stdvar;
The macro variable _StdVar
contains the following variable list:
SepalLength SepalWidth PetalLength PetalWidth
You could use this macro variable if you want to analyze these variables in subsequent steps as follows:
proc discrim data=sashelp.iris; class Species; var &_stdvar; run;
The results of this step are not shown.