Example 12.5 Stackloss Data

The following example analyzes the three regressors of Brownlee (1965) stackloss data. By default, the MVE subroutine tries only 2000 randomly selected subsets in its search. There are, in total, 5985 subsets of 4 cases out of 21 cases. Here is the code:

title2 "***MVE for Stackloss Data***";
title3 "*** Use All Subsets***";
aa = { 1  80  27  89  42,
     1  80  27  88  37,
     1  75  25  90  37,
     1  62  24  87  28,
     1  62  22  87  18,
     1  62  23  87  18,
     1  62  24  93  19,
     1  62  24  93  20,
     1  58  23  87  15,
     1  58  18  80  14,
     1  58  18  89  14,
     1  58  17  88  13,
     1  58  18  82  11,
     1  58  19  93  12,
     1  50  18  89   8,
     1  50  18  86   7,
     1  50  19  72   8,
     1  50  19  79   8,
     1  50  20  80   9,
     1  56  20  82  15,
     1  70  20  91  15 };
a = aa[,2:4];
optn = j(9,1,.);
optn[1]= 2;              /* ipri */
optn[2]= 1;              /* pcov: print COV */
optn[3]= 1;              /* pcor: print CORR */
optn[5]= -1;             /* nrep: use all subsets */

call mve(sc,xmve,dist,optn,a);

Output 12.5.1 of the output shows the classical scatter and correlation matrix.

Output 12.5.1: Some Simple Statistics

*** Brainlog Data: Do MCD, MVE ***
***MVE for Stackloss Data***
*** Use All Subsets***

Classical Covariance Matrix
  VAR1 VAR2 VAR3
VAR1 84.057142857 22.657142857 24.571428571
VAR2 22.657142857 9.9904761905 6.6214285714
VAR3 24.571428571 6.6214285714 28.714285714

Classical Correlation Matrix
  VAR1 VAR2 VAR3
VAR1 1 0.781852333 0.5001428749
VAR2 0.781852333 1 0.3909395378
VAR3 0.5001428749 0.3909395378 1

Classical Mean
VAR1 60.428571429
VAR2 21.095238095
VAR3 86.285714286


Output 12.5.2 shows the results of the optimization (complete subset sampling).

Output 12.5.2: Iteration History

Subset Singular Best
Criterion
Percent
1497 22 253.312431 25
2993 46 224.084073 50
4489 77 165.830053 75
5985 156 165.634363 100

Observations of Best Subset
7 10 14 20

Initial MVE Location
Estimates
VAR1 58.5
VAR2 20.25
VAR3 87

Initial MVE Scatter Matrix
  VAR1 VAR2 VAR3
VAR1 34.829014749 28.413143611 62.32560534
VAR2 28.413143611 38.036950318 58.659393261
VAR3 62.32560534 58.659393261 267.63348175


Output 12.5.3 shows the optimization results after local improvement.

Output 12.5.3: Table of MVE Results

Robust MVE Location Estimates
VAR1 56.705882353
VAR2 20.235294118
VAR3 85.529411765

Robust MVE Scatter Matrix
  VAR1 VAR2 VAR3
VAR1 23.470588235 7.5735294118 16.102941176
VAR2 7.5735294118 6.3161764706 5.3676470588
VAR3 16.102941176 5.3676470588 32.389705882

Eigenvalues of Robust
Scatter Matrix
VAR1 46.597431018
VAR2 12.155938483
VAR3 3.423101087

Robust Correlation Matrix
  VAR1 VAR2 VAR3
VAR1 1 0.6220269501 0.5840361335
VAR2 0.6220269501 1 0.375278187
VAR3 0.5840361335 0.375278187 1


Output 12.5.4 presents a table that contains the classical Mahalanobis distances, the robust distances, and the weights identifying the outlying observations (that is, the leverage points when explaining $y$ with these three regressor variables).

Output 12.5.4: Mahalanobis and Robust Distances

Classical Distances and Robust (Rousseeuw) Distances
Unsquared Mahalanobis Distance and
Unsquared Rousseeuw Distance of Each Observation
N Mahalanobis Distances Robust Distances Weight
1 2.253603 5.528395 0
2 2.324745 5.637357 0
3 1.593712 4.197235 0
4 1.271898 1.588734 1.000000
5 0.303357 1.189335 1.000000
6 0.772895 1.308038 1.000000
7 1.852661 1.715924 1.000000
8 1.852661 1.715924 1.000000
9 1.360622 1.226680 1.000000
10 1.745997 1.936256 1.000000
11 1.465702 1.493509 1.000000
12 1.841504 1.913079 1.000000
13 1.482649 1.659943 1.000000
14 1.778785 1.689210 1.000000
15 1.690241 2.230109 1.000000
16 1.291934 1.767582 1.000000
17 2.700016 2.431021 1.000000
18 1.503155 1.523316 1.000000
19 1.593221 1.710165 1.000000
20 0.807054 0.675124 1.000000
21 2.176761 3.657281 0


The following specification generates three bivariate plots of the classical and robust tolerance ellipsoids. They are shown in Output 12.5.5, Output 12.5.6, and Output 12.5.7, one plot for each pair of variables.

optn = j(9,1,.); optn[5]= -1;
vnam = { "Rate", "Temperature", "AcidConcent" };
filn = "stlmve";
titl = "Stackloss Data: Use All Subsets";
%include 'robustmc.sas';
call scatmve(2,optn,.9,a,vnam,titl,1,filn);

Output 12.5.5: Stackloss Data: Rate vs. Temperature (MVE)

Stackloss Data: Rate vs. Temperature (MVE)


Output 12.5.6: Stackloss Data: Rate vs. Acid Concentration (MVE)

Stackloss Data: Rate vs. Acid Concentration (MVE)


Output 12.5.7: Stackloss Data: Temperature vs. Acid Concentration (MVE)

Stackloss Data: Temperature vs. Acid Concentration (MVE)


You can also use the MCD method for the stackloss data as follows:

title2 "***MCD for Stackloss Data***";
title3 "*** Use 500 Random Subsets***";
a = aa[,2:4];
optn = j(8,1,.);
optn[1]= 2;              /* ipri */
optn[2]= 1;              /* pcov: print COV */
optn[3]= 1;              /* pcor: print CORR */
optn[5]= -1 ;            /* nrep: use all subsets */

call mcd(sc,xmcd,dist,optn,a);

The optimization results are displayed in Output 12.5.8. The reweighted results are displayed in Output 12.5.9.

Output 12.5.8: MCD Results of Optimization

*** Brainlog Data: Do MCD, MVE ***
***MCD for Stackloss Data***
*** Use 500 Random Subsets***

4 5 6 7 8 9 10 11 12 13 14 20

MCD Location Estimate
VAR1 VAR2 VAR3
59.5 20.833333333 87.333333333

MCD Scatter Matrix Estimate
  VAR1 VAR2 VAR3
VAR1 5.1818181818 4.8181818182 4.7272727273
VAR2 4.8181818182 7.6060606061 5.0606060606
VAR3 4.7272727273 5.0606060606 19.151515152

MCD Correlation Matrix
  VAR1 VAR2 VAR3
VAR1 1 0.7674714142 0.4745347313
VAR2 0.7674714142 1 0.4192963398
VAR3 0.4745347313 0.4192963398 1

Consistent Scatter Matrix
  VAR1 VAR2 VAR3
VAR1 8.6578437815 8.0502757968 7.8983838007
VAR2 8.0502757968 12.708297013 8.4553211199
VAR3 7.8983838007 8.4553211199 31.998580526


Output 12.5.9: Final Reweighted MCD Results

Reweighted Location Estimate
VAR1 VAR2 VAR3
59.5 20.833333333 87.333333333

Reweighted Scatter Matrix
  VAR1 VAR2 VAR3
VAR1 5.1818181818 4.8181818182 4.7272727273
VAR2 4.8181818182 7.6060606061 5.0606060606
VAR3 4.7272727273 5.0606060606 19.151515152

Eigenvalues
VAR1 VAR2 VAR3
23.191069268 7.3520037086 1.3963209628

Reweighted Correlation Matrix
  VAR1 VAR2 VAR3
VAR1 1 0.7674714142 0.4745347313
VAR2 0.7674714142 1 0.4192963398
VAR3 0.4745347313 0.4192963398 1


The MCD robust distances and outlying diagnostic are displayed in Output 12.5.10. MCD identifies more leverage points than MVE.

Output 12.5.10: MCD Robust Distances

Classical Distances and Robust (Rousseeuw) Distances
Unsquared Mahalanobis Distance and
Unsquared Rousseeuw Distance of Each Observation
N Mahalanobis Distances Robust Distances Weight
1 2.253603 12.173282 0
2 2.324745 12.255677 0
3 1.593712 9.263990 0
4 1.271898 1.401368 1.000000
5 0.303357 1.420020 1.000000
6 0.772895 1.291188 1.000000
7 1.852661 1.460370 1.000000
8 1.852661 1.460370 1.000000
9 1.360622 2.120590 1.000000
10 1.745997 1.809708 1.000000
11 1.465702 1.362278 1.000000
12 1.841504 1.667437 1.000000
13 1.482649 1.416724 1.000000
14 1.778785 1.988240 1.000000
15 1.690241 5.874858 0
16 1.291934 5.606157 0
17 2.700016 6.133319 0
18 1.503155 5.760432 0
19 1.593221 6.156248 0
20 0.807054 2.172300 1.000000
21 2.176761 7.622769 0


Similarly, you can use the SCATMCD routine to generate three bivariate plots of the classical and robust tolerance ellipsoids, one plot for each pair of variables. Here is the code:

optn = j(9,1,.); optn[5]= -1;
vnam = { "Rate", "Temperature", "AcidConcent" };
filn = "stlmcd";
titl = "Stackloss Data: Use All Subsets";
%include 'robustmc.sas';
call scatmcd(2,optn,.9,a,vnam,titl,1,filn);

Output 12.5.11, Output 12.5.12, and Output 12.5.13 display these plots.

Output 12.5.11: Stackloss Data: Rate vs. Temperature (MCD)

Stackloss Data: Rate vs. Temperature (MCD)


Output 12.5.12: Stackloss Data: Rate vs. Acid Concentration (MCD)

Stackloss Data: Rate vs. Acid Concentration (MCD)


Output 12.5.13: Stackloss Data: Temperature vs. Acid Concentration (MCD)

Stackloss Data: Temperature vs. Acid Concentration (MCD)