Robust Regression Examples


Example 12.5 Multivariate Location, Scale, and Outliers

This section analyzes the three regressors in the stack loss data of Brownlee (1965), which are defined in Example 12.2. As in the previous section, the LOAD MODULES=_ALL_ statement loads modules that are defined in the RobustMC.sas file.

%include "C:\<path>\RobustMC.sas";
proc iml;
load module=_all_;

By default, the MVE subroutine generates 2,000 randomly selected subsets in its search. The following call to the MVE subroutine uses all 5,985 subsets of four observations that can be chosen from the 21 observations:

   /* Obs X1  X2  X3   Y  Stack Loss data */
SL = { 1  80  27  89  42,
       2  80  27  88  37,
       3  75  25  90  37,
       4  62  24  87  28,
       5  62  22  87  18,
       6  62  23  87  18,
       7  62  24  93  19,
       8  62  24  93  20,
       9  58  23  87  15,
      10  58  18  80  14,
      11  58  18  89  14,
      12  58  17  88  13,
      13  58  18  82  11,
      14  58  19  93  12,
      15  50  18  89   8,
      16  50  18  86   7,
      17  50  19  72   8,
      18  50  19  79   8,
      19  50  20  80   9,
      20  56  20  82  15,
      21  70  20  91  15 };
x = SL[, 2:4]; y = SL[, 5];

optn = j(5,1,.);
optn[1] = 1;             /* print basic output */
optn[2] = 1;             /* print covariance matrices */
optn[5]= -1;             /* nrep: use all subsets */

ods select ClassicalMean ClassicalCov RobustMVELoc RobustMVEScatter;
call mve(sc, xmve, dist, optn, x);
ods select all;

Output 12.5.1 shows the classical and robust estimates of the location. Output 12.5.2 shows the classical and robust estimates of the scatter.

Output 12.5.1: Classical and Robust Estimates of the Location

Classical Mean
VAR1 60.428571429
VAR2 21.095238095
VAR3 86.285714286

Robust MVE Location Estimates
VAR1 56.705882353
VAR2 20.235294118
VAR3 85.529411765



Output 12.5.2: Classical and Robust Estimates of the Scatter

Classical Covariance Matrix
  VAR1 VAR2 VAR3
VAR1 84.057142857 22.657142857 24.571428571
VAR2 22.657142857 9.9904761905 6.6214285714
VAR3 24.571428571 6.6214285714 28.714285714

Robust MVE Scatter Matrix
  VAR1 VAR2 VAR3
VAR1 23.470588235 7.5735294118 16.102941176
VAR2 7.5735294118 6.3161764706 5.3676470588
VAR3 16.102941176 5.3676470588 32.389705882



The following statements generate three bivariate scatter plots of the classical and robust tolerance ellipsoids. The plots are shown in Output 12.5.3, Output 12.5.4, and Output 12.5.5, one plot for each pair of variables.

optn = j(5,1,.); optn[5]= -1;
vnam = {"Rate", "Temperature", "AcidConcent"};
titl = "Stack Loss Data: Use All Subsets";
call MVEScatter(x, optn, 0.9, vnam, titl);

Output 12.5.3: Stack Loss Data: Rate versus Temperature (MVE)

Stack Loss Data: Rate versus Temperature (MVE)


Output 12.5.4: Stack Loss Data: Rate versus Acid Concentration (MVE)

Stack Loss Data: Rate versus Acid Concentration (MVE)


Output 12.5.5: Stack Loss Data: Temperature versus Acid Concentration (MVE)

Stack Loss Data: Temperature versus Acid Concentration (MVE)


You can also use the MCD method for the stack loss data as follows:

optn = j(5,1,.);
optn[1]= 2;              /* print distances */
optn[2]= 1;              /* print covariance matrices */
optn[5]= -1 ;            /* nrep: use all subsets */

call mcd(sc, xmcd, dist, optn, x);

The optimization results are displayed in Output 12.5.6. The reweighted results are displayed in Output 12.5.7.

Output 12.5.6: MCD Results of Optimization

4 5 6 7 8 9 10 11 12 13 14 20

MCD Location Estimate
VAR1 VAR2 VAR3
59.5 20.833333333 87.333333333

MCD Scatter Matrix Estimate
  VAR1 VAR2 VAR3
VAR1 5.1818181818 4.8181818182 4.7272727273
VAR2 4.8181818182 7.6060606061 5.0606060606
VAR3 4.7272727273 5.0606060606 19.151515152

Consistent Scatter Matrix
  VAR1 VAR2 VAR3
VAR1 8.6578437815 8.0502757968 7.8983838007
VAR2 8.0502757968 12.708297013 8.4553211199
VAR3 7.8983838007 8.4553211199 31.998580526



Output 12.5.7: Final Reweighted MCD Results

Reweighted Location Estimate
VAR1 VAR2 VAR3
59.5 20.833333333 87.333333333

Reweighted Scatter Matrix
  VAR1 VAR2 VAR3
VAR1 5.1818181818 4.8181818182 4.7272727273
VAR2 4.8181818182 7.6060606061 5.0606060606
VAR3 4.7272727273 5.0606060606 19.151515152

Eigenvalues
VAR1 VAR2 VAR3
23.191069268 7.3520037086 1.3963209628



The MCD robust distances and outlying diagnostic are displayed in Output 12.5.8. MCD identifies more leverage points than MVE identifies.

Output 12.5.8: MCD Robust Distances

Classical Distances and Robust (Rousseeuw) Distances
Unsquared Mahalanobis Distance and
Unsquared Rousseeuw Distance of Each Observation
N Mahalanobis Distances Robust Distances Weight
1 2.253603 12.173282 0
2 2.324745 12.255677 0
3 1.593712 9.263990 0
4 1.271898 1.401368 1.000000
5 0.303357 1.420020 1.000000
6 0.772895 1.291188 1.000000
7 1.852661 1.460370 1.000000
8 1.852661 1.460370 1.000000
9 1.360622 2.120590 1.000000
10 1.745997 1.809708 1.000000
11 1.465702 1.362278 1.000000
12 1.841504 1.667437 1.000000
13 1.482649 1.416724 1.000000
14 1.778785 1.988240 1.000000
15 1.690241 5.874858 0
16 1.291934 5.606157 0
17 2.700016 6.133319 0
18 1.503155 5.760432 0
19 1.593221 6.156248 0
20 0.807054 2.172300 1.000000
21 2.176761 7.622769 0



The following statements generate three bivariate scatter plots of the classical and robust tolerance ellipsoids:

optn = j(5,1,.); optn[5]= -1;
vnam = {"Rate", "Temperature", "AcidConcent"};
titl = "Stack Loss Data: Use All Subsets";
call MCDScatter(x, optn, 0.9, vnam, titl);

Output 12.5.9, Output 12.5.10, and Output 12.5.11 display these plots, one plot for each pair of variables:

Output 12.5.9: Stack Loss Data: Rate versus Temperature (MCD)

Stack Loss Data: Rate versus Temperature (MCD)


Output 12.5.10: Stack Loss Data: Rate versus Acid Concentration (MCD)

Stack Loss Data: Rate versus Acid Concentration (MCD)


Output 12.5.11: Stack Loss Data: Temperature versus Acid Concentration (MCD)

Stack Loss Data: Temperature versus Acid Concentration (MCD)