Previous Page | Next Page

The STEPDISC Procedure

Example 82.1 Performing a Stepwise Discriminant Analysis

The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on 50 iris specimens from each of three species: Iris setosa, I. versicolor, and I. virginica.

   title 'Fisher (1936) Iris Data';
   
   proc format;
      value specname
         1='Setosa    '
         2='Versicolor'
         3='Virginica ';
   run;
   
   data iris;
      input SepalLength SepalWidth PetalLength PetalWidth
            Species @@;
      format Species specname.;
      label SepalLength='Sepal Length in mm.'
            SepalWidth ='Sepal Width in mm.'
            PetalLength='Petal Length in mm.'
            PetalWidth ='Petal Width in mm.';
      datalines;
   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
   
   ... more lines ...   

   63 33 60 25 3 53 37 15 02 1
   ;

A stepwise discriminant analysis is performed by using stepwise selection.

In the PROC STEPDISC statement, the BSSCP and TSSCP options display the between-class SSCP matrix and the total-sample corrected SSCP matrix. By default, the significance level of an test from an analysis of covariance is used as the selection criterion. The variable under consideration is the dependent variable, and the variables already chosen act as covariates. The following SAS statements produce Output 82.1.1 through Output 82.1.8:

   %let _stdvar = ;
   proc stepdisc data=iris bsscp tsscp;
      class Species;
      var SepalLength SepalWidth PetalLength PetalWidth;
   run;

Output 82.1.1 Iris Data: Summary Information
Fisher (1936) Iris Data

The STEPDISC Procedure

The Method for Selecting Variables is STEPWISE
Total Sample Size 150 Variable(s) in the Analysis 4
Class Levels 3 Variable(s) Will Be Included 0
    Significance Level to Enter 0.15
    Significance Level to Stay 0.15

Number of Observations Read 150
Number of Observations Used 150

Class Level Information
Species Variable
Name
Frequency Weight Proportion
Setosa Setosa 50 50.0000 0.333333
Versicolor Versicolor 50 50.0000 0.333333
Virginica Virginica 50 50.0000 0.333333

Output 82.1.2 Iris Data: Between-Class and Total-Sample SSCP Matrices
Fisher (1936) Iris Data

The STEPDISC Procedure

Between-Class SSCP Matrix
Variable Label SepalLength SepalWidth PetalLength PetalWidth
SepalLength Sepal Length in mm. 6321.21333 -1995.26667 16524.84000 7127.93333
SepalWidth Sepal Width in mm. -1995.26667 1134.49333 -5723.96000 -2293.26667
PetalLength Petal Length in mm. 16524.84000 -5723.96000 43710.28000 18677.40000
PetalWidth Petal Width in mm. 7127.93333 -2293.26667 18677.40000 8041.33333

Total-Sample SSCP Matrix
Variable Label SepalLength SepalWidth PetalLength PetalWidth
SepalLength Sepal Length in mm. 10216.83333 -632.26667 18987.30000 7692.43333
SepalWidth Sepal Width in mm. -632.26667 2830.69333 -4911.88000 -1812.42667
PetalLength Petal Length in mm. 18987.30000 -4911.88000 46432.54000 19304.58000
PetalWidth Petal Width in mm. 7692.43333 -1812.42667 19304.58000 8656.99333

In step 1, the tolerance is 1.0 for each variable under consideration because no variables have yet entered the model. The variable PetalLength is selected because its statistic, 1180.161, is the largest among all variables.

Output 82.1.3 Iris Data: Stepwise Selection Step 1
Fisher (1936) Iris Data

The STEPDISC Procedure
Stepwise Selection: Step 1

Statistics for Entry, DF = 2, 147
Variable Label R-Square F Value Pr > F Tolerance
SepalLength Sepal Length in mm. 0.6187 119.26 <.0001 1.0000
SepalWidth Sepal Width in mm. 0.4008 49.16 <.0001 1.0000
PetalLength Petal Length in mm. 0.9414 1180.16 <.0001 1.0000
PetalWidth Petal Width in mm. 0.9289 960.01 <.0001 1.0000

Variable PetalLength will be entered.

Variable(s) That
Have Been Entered
PetalLength

Multivariate Statistics
Statistic Value F Value Num DF Den DF Pr > F
Wilks' Lambda 0.058628 1180.16 2 147 <.0001
Pillai's Trace 0.941372 1180.16 2 147 <.0001
Average Squared Canonical Correlation 0.470686        


In step 2, with the variable PetalLength already in the model, PetalLength is tested for removal before a new variable is selected for entry. Since PetalLength meets the criterion to stay, it is used as a covariate in the analysis of covariance for variable selection. The variable SepalWidth is selected because its statistic, 43.035, is the largest among all variables not in the model and because its associated tolerance, 0.8164, meets the criterion to enter. The process is repeated in steps 3 and 4. The variable PetalWidth is entered in step 3, and the variable SepalLength is entered in step 4.

Output 82.1.4 Iris Data: Stepwise Selection Step 2
Fisher (1936) Iris Data

The STEPDISC Procedure
Stepwise Selection: Step 2

Statistics for Removal, DF = 2, 147
Variable Label R-Square F Value Pr > F
PetalLength Petal Length in mm. 0.9414 1180.16 <.0001

No variables can be removed.

Statistics for Entry, DF = 2, 146
Variable Label Partial
R-Square
F Value Pr > F Tolerance
SepalLength Sepal Length in mm. 0.3198 34.32 <.0001 0.2400
SepalWidth Sepal Width in mm. 0.3709 43.04 <.0001 0.8164
PetalWidth Petal Width in mm. 0.2533 24.77 <.0001 0.0729

Variable SepalWidth will be entered.

Variable(s) That Have Been
Entered
SepalWidth PetalLength

Multivariate Statistics
Statistic Value F Value Num DF Den DF Pr > F
Wilks' Lambda 0.036884 307.10 4 292 <.0001
Pillai's Trace 1.119908 93.53 4 294 <.0001
Average Squared Canonical Correlation 0.559954        


Output 82.1.5 Iris Data: Stepwise Selection Step 3
Fisher (1936) Iris Data

The STEPDISC Procedure
Stepwise Selection: Step 3

Statistics for Removal, DF = 2, 146
Variable Label Partial
R-Square
F Value Pr > F
SepalWidth Sepal Width in mm. 0.3709 43.04 <.0001
PetalLength Petal Length in mm. 0.9384 1112.95 <.0001

No variables can be removed.

Statistics for Entry, DF = 2, 145
Variable Label Partial
R-Square
F Value Pr > F Tolerance
SepalLength Sepal Length in mm. 0.1447 12.27 <.0001 0.1323
PetalWidth Petal Width in mm. 0.3229 34.57 <.0001 0.0662

Variable PetalWidth will be entered.

Variable(s) That Have Been Entered
SepalWidth PetalLength PetalWidth

Multivariate Statistics
Statistic Value F Value Num DF Den DF Pr > F
Wilks' Lambda 0.024976 257.50 6 290 <.0001
Pillai's Trace 1.189914 71.49 6 292 <.0001
Average Squared Canonical Correlation 0.594957        


Output 82.1.6 Iris Data: Stepwise Selection Step 4
Fisher (1936) Iris Data

The STEPDISC Procedure
Stepwise Selection: Step 4

Statistics for Removal, DF = 2, 145
Variable Label Partial
R-Square
F Value Pr > F
SepalWidth Sepal Width in mm. 0.4295 54.58 <.0001
PetalLength Petal Length in mm. 0.3482 38.72 <.0001
PetalWidth Petal Width in mm. 0.3229 34.57 <.0001

No variables can be removed.

Statistics for Entry, DF = 2, 144
Variable Label Partial
R-Square
F Value Pr > F Tolerance
SepalLength Sepal Length in mm. 0.0615 4.72 0.0103 0.0320

Variable SepalLength will be entered.

All variables have been entered.

Multivariate Statistics
Statistic Value F Value Num DF Den DF Pr > F
Wilks' Lambda 0.023439 199.15 8 288 <.0001
Pillai's Trace 1.191899 53.47 8 290 <.0001
Average Squared Canonical Correlation 0.595949        

Since no more variables can be added to or removed from the model, the procedure stops at step 5 and displays a summary of the selection process.

Output 82.1.7 Iris Data: Stepwise Selection Step 5
Fisher (1936) Iris Data

The STEPDISC Procedure
Stepwise Selection: Step 5

Statistics for Removal, DF = 2, 144
Variable Label Partial
R-Square
F Value Pr > F
SepalLength Sepal Length in mm. 0.0615 4.72 0.0103
SepalWidth Sepal Width in mm. 0.2335 21.94 <.0001
PetalLength Petal Length in mm. 0.3308 35.59 <.0001
PetalWidth Petal Width in mm. 0.2570 24.90 <.0001

No variables can be removed.

Output 82.1.8 Iris Data: Stepwise Selection Summary
No further steps are possible.

Fisher (1936) Iris Data

The STEPDISC Procedure

Stepwise Selection Summary
Step Number
In
Entered Removed Label Partial
R-Square
F Value Pr > F Wilks'
Lambda
Pr <
Lambda
Average
Squared
Canonical
Correlation
Pr >
ASCC
1 1 PetalLength   Petal Length in mm. 0.9414 1180.16 <.0001 0.05862828 <.0001 0.47068586 <.0001
2 2 SepalWidth   Sepal Width in mm. 0.3709 43.04 <.0001 0.03688411 <.0001 0.55995394 <.0001
3 3 PetalWidth   Petal Width in mm. 0.3229 34.57 <.0001 0.02497554 <.0001 0.59495691 <.0001
4 4 SepalLength   Sepal Length in mm. 0.0615 4.72 0.0103 0.02343863 <.0001 0.59594941 <.0001


PROC STEPDISC automatically creates a list of the selected variables and stores it in a macro variable. You can submit the following statement to see the list of selected variables:

   * print the macro variable list;
   %put &_stdvar;

The macro variable _StdVar contains the following variable list:

   SepalLength SepalWidth PetalLength PetalWidth

You could use this macro variable if you want to analyze these variables in subsequent steps as follows:

   proc discrim data=iris;
      class Species;
      var &_stdvar;
   run;

The results of this step are not shown.

Previous Page | Next Page | Top of Page