SAS/IML Robust Regression Examples
Example 9.4: Brainlog Data
The following data, consisting of the body weights (in kilograms) and brain weights (in grams) of N=28 animals, are reported by Jerison (1973) and can be found also in Rousseeuw and Leroy
(1987, p. 57). Instead of the original data, this example uses the logarithms of the measurements of the two variables.
title "*** Brainlog Data: Do MVE ***";
aa={ 1.303338E-001 9.084851E-001 ,
2.6674530 2.6263400 ,
1.5602650 2.0773680 ,
1.4418520 2.0606980 ,
1.703332E-002 7.403627E-001 ,
4.0681860 1.6989700 ,
3.4060290 3.6630410 ,
2.2720740 2.6222140 ,
2.7168380 2.8162410 ,
1.0000000 2.0606980 ,
5.185139E-001 1.4082400 ,
2.7234560 2.8325090 ,
2.3159700 2.6085260 ,
1.7923920 3.1205740 ,
3.8230830 3.7567880 ,
3.9731280 1.8450980 ,
8.325089E-001 2.2528530 ,
1.5440680 1.7481880 ,
-9.208187E-001 .0000000 ,
-1.6382720 -3.979400E-001 ,
3.979400E-001 1.0827850 ,
1.7442930 2.2430380 ,
2.0000000 2.1959000 ,
1.7173380 2.6434530 ,
4.9395190 2.1889280 ,
-5.528420E-001 2.787536E-001 ,
-9.136401E-001 4.771213E-001 ,
2.2833010 2.2552720 };
By default, the MVE subroutine (like the MINVOL subroutine) uses only 1500 randomly selected subsets rather than all subsets.
The following specification of the options vector requires that all 3276 subsets of 3 cases out of 28 cases are generated and
evaluated:
title2 "***MVE for BrainLog Data***";
title3 "*** Use All Subsets***";
optn = j(8,1,.);
optn[1]= 3; /* ipri */
optn[2]= 1; /* pcov: print COV */
optn[3]= 1; /* pcor: print CORR */
optn[6]= -1; /* nrep: all subsets */
call mve(sc,xmve,dist,optn,aa);
Specifying optn[1]=3, optn[2]=1, and optn[3]=1 requests that all output be printed. Therefore, the
first part of the output shows the classical scatter and correlation matrix.
Output 9.4.1: Some Simple Statistics
|
Minimum Volume Ellipsoid (MVE) Estimation
|
|
Consider Ellipsoids Containing 15 Cases.
|
|
Classical Covariance Matrix
|
|
|
VAR1
|
VAR2
|
|
VAR1
|
2.6816512357
|
1.3300846932
|
|
VAR2
|
1.3300846932
|
1.0857537552
|
|
Classical Correlation Matrix
|
|
|
VAR1
|
VAR2
|
|
VAR1
|
1
|
0.7794934643
|
|
VAR2
|
0.7794934643
|
1
|
|
Classical Mean
|
|
VAR1
|
1.6378572186
|
|
VAR2
|
1.9219465964
|
|
The second part of the output shows the results of the combinatoric optimization (complete subset sampling).
Output 9.4.2: Iteration History for MVE
|
There are 3276 subsets of 3 cases out of 28 cases.
|
|
The algorithm will draw 1500 random subsets of 3 cases.
|
|
Random Subsampling for MVE
|
|
Subset
|
Singular
|
Best
Criterion
|
Percent
|
|
375
|
0
|
0.453147
|
25
|
|
750
|
0
|
0.453147
|
50
|
|
1125
|
0
|
0.453147
|
75
|
|
1500
|
0
|
0.453147
|
100
|
|
Minimum Criterion= 0.4531471437
|
|
Among 1500 subsets 0 are singular.
|
|
Observations of Best Subset
|
|
1
|
22
|
2
|
Initial MVE Location
Estimates
|
|
VAR1
|
1.5140266
|
|
VAR2
|
1.9259543667
|
|
Initial MVE Scatter Matrix
|
|
|
VAR1
|
VAR2
|
|
VAR1
|
7.5699818309
|
5.2533273517
|
|
VAR2
|
5.2533273517
|
3.7329226108
|
|
The third part of the output shows the optimization results after local improvement.
Output 9.4.3: Table of MVE Results
|
Final MVE Estimates (Using Local Improvement)
|
|
Number of Points with Nonzero Weight=23
|
|
Robust MVE Location Estimates
|
|
VAR1
|
1.3154029661
|
|
VAR2
|
1.8568731174
|
|
Robust MVE Scatter Matrix
|
|
|
VAR1
|
VAR2
|
|
VAR1
|
2.139986054
|
1.6068556533
|
|
VAR2
|
1.6068556533
|
1.2520384784
|
Eigenvalues of Robust
Scatter Matrix
|
|
VAR1
|
3.363074897
|
|
VAR2
|
0.0289496354
|
|
Robust Correlation Matrix
|
|
|
VAR1
|
VAR2
|
|
VAR1
|
1
|
0.9816633012
|
|
VAR2
|
0.9816633012
|
1
|
|
The final output presents a table containing the classical Mahalanobis distances, the robust distances, and the weights
identifying the outlier observations.
Output 9.4.4: Mahalanobis and Robust Distances
|
Classical Distances and Robust (Rousseeuw) Distances
|
|
Unsquared Mahalanobis Distance and
|
|
Unsquared Rousseeuw Distance of Each Observation
|
|
N
|
Mahalanobis Distances
|
Robust Distances
|
Weight
|
|
1
|
1.006591
|
0.855347
|
1.000000
|
|
2
|
0.695261
|
1.477050
|
1.000000
|
|
3
|
0.300831
|
0.239828
|
1.000000
|
|
4
|
0.380817
|
0.517719
|
1.000000
|
|
5
|
1.146485
|
1.108362
|
1.000000
|
|
6
|
2.644176
|
10.599341
|
0
|
|
7
|
1.708334
|
1.808455
|
1.000000
|
|
8
|
0.706522
|
0.690099
|
1.000000
|
|
9
|
0.858404
|
1.052423
|
1.000000
|
|
10
|
0.798698
|
2.077131
|
1.000000
|
|
11
|
0.686485
|
0.888545
|
1.000000
|
|
12
|
0.874349
|
1.035824
|
1.000000
|
|
13
|
0.677791
|
0.683978
|
1.000000
|
|
14
|
1.721526
|
4.257963
|
0
|
|
15
|
1.761947
|
1.716065
|
1.000000
|
|
16
|
2.369473
|
9.584992
|
0
|
|
17
|
1.222253
|
3.571700
|
0
|
|
18
|
0.203178
|
1.323783
|
1.000000
|
|
19
|
1.855201
|
1.741064
|
1.000000
|
|
20
|
2.266268
|
2.026528
|
1.000000
|
|
21
|
0.831416
|
0.743545
|
1.000000
|
|
22
|
0.416158
|
0.419923
|
1.000000
|
|
23
|
0.264182
|
0.944610
|
1.000000
|
|
24
|
1.046120
|
2.289334
|
1.000000
|
|
25
|
2.911101
|
11.471953
|
0
|
|
26
|
1.586458
|
1.518721
|
1.000000
|
|
27
|
1.582124
|
2.054593
|
1.000000
|
|
28
|
0.394664
|
1.675651
|
1.000000
|
|
Distribution of Robust Distances
|
|
MinRes
|
1st Qu.
|
Median
|
Mean
|
3rd Qu.
|
MaxRes
|
|
0.2398280857
|
0.8719458303
|
1.4978856181
|
2.4419474031
|
2.0658619775
|
11.471952618
|
|
Cutoff Value = 2.7162030315
|
The cutoff value is the square root of the 0.975 quantile of the chi square distribution
with 2 degrees of freedom.
|
There are 5 points with large robust distances receiving zero weights. These may include
boundary cases. Only points whose robust distances are subs tantially larger than the
cutoff value should be considered outliers.
|
|
Again, you can call the subroutine scatmve(), which is included in the sample library in file robustmc.sas, to
plot the classical and robust confidence ellipsoids:
optn = j(8,1,.); optn[6]= -1;
vnam = { "Log Body Wgt","Log Brain Wgt" };
filn = "brl";
titl = "BrainLog Data: Use All Subsets";
call scatmve(2,optn,.9,aa,vnam,titl,1,filn);
The output follows.
Output 9.4.5: BrainLog Data: Classical and Robust Ellipsoid
Statistics and Operations Research | SAS/IML Software