Robust Regression Examples


Example 13.4 Relationship between Brain Mass and Body Mass

This section creates graphs that illustrate the results of the MVE and MCD procedures. The following statements load the RobustMC.sas program, which is included in the SAS/IML sample library. The LOAD MODULES=_ALL_ statement loads modules that are defined in the program. In particular, this section uses the MVESCATTER and MCDSCATTER modules.

%include sampsrc(robustmc);   /* define graphing modules */
proc iml;
load module=_all_;            /* load graphing modules */

Jerison (1973) reported data for the body mass (in kilograms) and brain mass (in grams) of $N=28$ animals. These data were further analyzed in Rousseeuw and Leroy (1987). Instead of the original data, the following example uses the logarithms of the measurements of the two variables:

/* Log(Body Mass) Log(Brain Mass) */
mass={ 0.1303338  0.9084851,  2.6674530  2.6263400,
       1.5602650  2.0773680,  1.4418520  2.0606980,
       0.0170333  0.7403627,  4.0681860  1.6989700,
       3.4060290  3.6630410,  2.2720740  2.6222140,
       2.7168380  2.8162410,  1.0000000  2.0606980,
       0.5185139  1.4082400,  2.7234560  2.8325090,
       2.3159700  2.6085260,  1.7923920  3.1205740,
       3.8230830  3.7567880,  3.9731280  1.8450980,
       0.8325089  2.2528530,  1.5440680  1.7481880,
      -0.9208187  0.0000000, -1.6382720 -0.3979400,
       0.3979400  1.0827850,  1.7442930  2.2430380,
       2.0000000  2.1959000,  1.7173380  2.6434530,
       4.9395190  2.1889280, -0.5528420  0.2787536,
      -0.9136401  0.4771213,  2.2833010  2.2552720};

By default, the MVE subroutine uses randomly selected subsets rather than all subsets. The following statements specify that all 3,276 subsets of three observations out of 28 observations be generated and evaluated. Output 13.4.1 shows partial results of the analysis.

optn = j(5,1,.);
optn[1] = 1;             /* print basic output */
optn[2] = 1;             /* print covariance matrices */
optn[5]= -1;             /* nrep: use all subsets */

ods exclude EigenRobust Distances DistrRes;
call mve(sc, xmve, dist, optn, mass);
ods select all;

Output 13.4.1: Results of MVE Robust Estimation


Minimum Volume Ellipsoid (MVE) Estimation


Consider Ellipsoids Containing 15 Cases.

Classical Covariance Matrix
  VAR1 VAR2
VAR1 2.6816512381 1.3300846941
VAR2 1.3300846941 1.0857537552

Classical Mean
VAR1 1.6378572179
VAR2 1.9219465964


There are 3276 subsets of 3 cases out of 28 cases.


All 3276 subsets will be considered.


Complete Enumeration for MVE


25 % of calculations have been executed.


75 % of calculations have been executed.


Minimum Criterion= 0.439709106


Among 3276 subsets 0 are singular.

Initial MVE Location
Estimates
VAR1 1.3859759333
VAR2 1.8022650333

Initial MVE Scatter Matrix
  VAR1 VAR2
VAR1 4.9018525125 3.2937139101
VAR2 3.2937139101 2.3400650932


Final MVE Estimates (Using Local Improvement)


Number of Points with Nonzero Weight=24

Robust MVE Location Estimates
VAR1 1.2952823792
VAR2 1.8733722792

Robust MVE Scatter Matrix
  VAR1 VAR2
VAR1 2.056659296 1.5290250177
VAR2 1.5290250177 1.2041353589


Distribution of Robust Distances


Cutoff Value = 2.7162030315


The cutoff value is the square root of the 0.975 quantile of the chi square distribution with 2 degrees of freedom.


There are 5 points with large robust distances receiving zero weights. These may include boundary cases. Only points whose robust distances are substantially larger than the cutoff value should be considered outliers.



The MVE routine also returns information in the sc, xmve, and dist matrices. The following statements print some of that information:

r1 = {"Quantile", "Number of Subsets", "Number of Singular Subsets",
      "Number of Nonzero Weights", "Min Objective Function",
      "Min Distance", "Chi-Square Cutoff Value"};
RobustCenter = xmve[1,];
RobustCov =    xmve[3:4,];
print sc[r=r1],
      RobustCenter[c={"X1","X2"}],
      RobustCov[r={"X1","X2"} c={"X1","X2"}];
MVEOutliers = loc(dist[3,]=0);
print MVEOutliers;

Output 13.4.2: Robust Estimates and Outliers

sc
Quantile 15
Number of Subsets 3276
Number of Singular Subsets 0
Number of Nonzero Weights 23
Min Objective Function 0.4397091
Min Distance 1.4755584
Chi-Square Cutoff Value 2.716203

RobustCenter
X1 X2
1.2952824 1.8733723

RobustCov
  X1 X2
X1 2.0566593 1.529025
X2 1.529025 1.2041354

MVEOutliers
6 14 16 17 25



The first table in Output 13.4.2 shows some summary statistics about the combinatoric optimization (complete subset sampling) during the MVE algorithm. The algorithm considers subsets of three observations and strives to find the subset that minimizes the volume of an ellisoid that contains 15 points. The MVE algorithm considers 3,276 subsets, of which zero are singular. The minimum volume is found to be 1.47.

The mean vector is contained in the RobustCenter vector. The covariance matrix is contained in the RobustCov matrix.

Based on that center and covariance matrix, a Mahalanobis-type distance (called the robust distance) that measures each observation’s distance from the robust center is computed. Observations that are more than a certain distance (the "Cutoff Value") from the center are classified as outliers. For these data, observations 6, 14, 16, 17, and 25 are classified as outliers.

You can call the MVESCATTER subroutine, which is included in the sample library in the file RobustMC.sas, to plot the classical and robust confidence ellipsoids. The LOAD MODULES=_ALL_ statement loads modules that are defined in the program. The following statements create the scatter plot that is shown in Output 13.4.3.

optn = j(5,1,.); optn[5]= -1;
vnam = {"Log(Body Mass)","Log(Brain Mass)"};
titl = "Estimates of Location and Scale (MVE)";
call MVEScatter(mass, optn, 0.9, vnam, titl);

Output 13.4.3: BrainLog Data: Classical and Robust Ellipses (MVE)

BrainLog Data: Classical and Robust Ellipses (MVE)


MCD is another subroutine that can be used to compute the robust location and the robust covariance of multivariate data sets. The following statements call the MCD subroutine and produce Output 13.4.4

/* MCD: Use Random Subsets */
optn = j(5,1,.);
call mcd(sc, xmve, dist, optn, mass);

r1 = {"Quantile", "Number of Subsets", "Number of Singular Subsets",
      "Number of Nonzero Weights", "Min Objective Function",
      "Min Distance", "Chi-Square Cutoff Value"};
RobustCenter = xmve[1,];
RobustCov =    xmve[3:4,];
print sc[r=r1],
      RobustCenter[c={"X1","X2"}],
      RobustCov[r={"X1","X2"} c={"X1","X2"}];

Output 13.4.4: Results of MCD Robust Estimation

sc
Quantile 15
Number of Subsets 500
Number of Singular Subsets 0
Number of Nonzero Weights 23
Min Objective Function 0.0174302
Min Distance 1.9190823
Chi-Square Cutoff Value 2.716203

RobustCenter
X1 X2
1.315403 1.8568731

RobustCov
  X1 X2
X1 2.1399861 1.6068557
X2 1.6068557 1.2520385



The results are similar to Output 13.4.1. For the MCD subroutine, 500 random subsets are used for the optimization. The robust center and covariance matrices are slightly different from those found by the MVE subroutine. However, the observations that are classified as outliers (not shown) are the same.

You can call the MCDSCATTER subroutine, which is included in the SAS/IML sample library, to plot the classical and robust confidence ellipsoids, as follows:

optn = j(5,1,.); optn[5]= -1;
vnam = { "Log(Body Mass)","Log(Brain Mass)" };
titl = "Estimates of Location and Scale (MCD)";
call MCDScatter(mass, optn, 0.9, vnam, titl);

The plot is shown in Output 13.4.5. It looks very similar to Output 13.4.3.

Output 13.4.5: BrainLog Data: Classical and Robust Ellipsoid (MCD)

BrainLog Data: Classical and Robust Ellipsoid (MCD)