SAS/IML Robust Regression Examples
Example 9.1: LMS and LTS with Substantial Leverage Points: Hertzsprung-Russell Star Data
The following data are reported in Rousseeuw and Leroy (1987, p. 27) and are based on Humphrey (1978) and Vansina and De
Greve (1982). The 47 observations correspond to the 47 stars of the CYG OB1 cluster in the direction of Cygnus. The
regressor variable (column 2) x is the logarithm of the effective temperature at the
surface of the star (Te), and the response variable (column 3) y is the logarithm of its light intensity (L /
L0). The results for LS and LMS on page 28 of Rousseeuw and Leroy (1987) are based on a more
precise (five decimal places) version of the data set. This data set is remarkable in that it contains four substantial
leverage points (giant stars) corresponding to observations 11, 20, 30, and 34 that greatly affect the results of L2 and even L1 regression.
ab = { 1 4.37 5.23, 2 4.56 5.74, 3 4.26 4.93,
4 4.56 5.74, 5 4.30 5.19, 6 4.46 5.46,
7 3.84 4.65, 8 4.57 5.27, 9 4.26 5.57,
10 4.37 5.12, 11 3.49 5.73, 12 4.43 5.45,
13 4.48 5.42, 14 4.01 4.05, 15 4.29 4.26,
16 4.42 4.58, 17 4.23 3.94, 18 4.42 4.18,
19 4.23 4.18, 20 3.49 5.89, 21 4.29 4.38,
22 4.29 4.22, 23 4.42 4.42, 24 4.49 4.85,
25 4.38 5.02, 26 4.42 4.66, 27 4.29 4.66,
28 4.38 4.90, 29 4.22 4.39, 30 3.48 6.05,
31 4.38 4.42, 32 4.56 5.10, 33 4.45 5.22,
34 3.49 6.29, 35 4.23 4.34, 36 4.62 5.62,
37 4.53 5.10, 38 4.45 5.22, 39 4.53 5.18,
40 4.43 5.57, 41 4.38 4.62, 42 4.45 5.06,
43 4.50 5.34, 44 4.45 5.34, 45 4.55 5.54,
46 4.45 4.98, 47 4.42 4.50 } ;
a = ab[,2]; b = ab[,3];
The following code specifies that most of the output be printed.
print "*** Hertzsprung-Russell Star Data: Do LMS ***";
optn = j(8,1,.);
optn[2]= 3; /* ipri */
optn[3]= 3; /* ilsq */
optn[8]= 3; /* icov */
call lms(sc,coef,wgt,optn,b,a);
Output 9.1.1: Some Simple Statistics
|
|
|
Median and Mean
|
|
|
Median
|
Mean
|
|
VAR1
|
4.42
|
4.31
|
|
Intercep
|
1
|
1
|
|
Response
|
5.1
|
5.0121276596
|
|
|
|
Dispersion and Standard Deviation
|
|
|
Dispersion
|
StdDev
|
|
VAR1
|
0.163086244
|
0.2908234187
|
|
Intercep
|
0
|
0
|
|
Response
|
0.6671709983
|
0.5712493409
|
|
Partial output for LS regression is shown in Output 9.1.2.
Output 9.1.2: Table of Unweighted LS Regression
|
Unweighted Least-Squares Estimation
|
|
|
|
LS Parameter Estimates
|
|
Variable
|
Estimate
|
Approx
Std Err
|
t Value
|
Pr > |t|
|
Lower WCI
|
Upper WCI
|
|
VAR1
|
-0.4133039
|
0.28625748
|
-1.44
|
0.1557
|
-0.9743582
|
0.14775048
|
|
Intercep
|
6.7934673
|
1.23651563
|
5.49
|
<.0001
|
4.3699412
|
9.21699339
|
|
Sum of Squares = 14.346394626
|
|
LS Scale Estimate = 0.5646315343
|
|
|
|
Cov Matrix of Parameter Estimates
|
|
|
VAR1
|
Intercep
|
|
VAR1
|
0.0819433428
|
-0.353175807
|
|
Intercep
|
-0.353175807
|
1.5289708954
|
|
F(1,45) Statistic = 2.0846120667
|
|
Probability = 0.1557164396
|
|
Looking at the column Best Crit in the iteration history table, you see that, with complete enumeration, the optimal
solution is found very early.
Output 9.1.3: History of Iteration Process
|
Complete Enumeration for LMS
|
|
|
|
Subset
|
Singular
|
Best
Criterion
|
Percent
|
|
271
|
5
|
0.392791
|
25
|
|
541
|
8
|
0.392791
|
50
|
|
811
|
27
|
0.392791
|
75
|
|
1081
|
45
|
0.392791
|
100
|
|
Minimum Criterion= 0.3927910898
|
|
Least Median of Squares (LMS) Method
|
|
Minimizing 25th Ordered Squared Residual.
|
|
Highest Possible Breakdown Value = 48.94 %
|
|
Selection of All 1081 Subsets of 2 Cases Out of 47
|
|
Among 1081 subsets 45 are singular.
|
|
Output 9.1.4: Results of Optimization
|
|
|
Observations of Best Subset
|
|
2
|
29
|
|
|
|
Estimated Coefficients
|
|
VAR1
|
Intercep
|
|
3.9705882353
|
-12.62794118
|
|
LMS Objective Function = 0.2620588235
|
|
Preliminary LMS Scale = 0.3987301586
|
|
Robust R Squared = 0.5813148789
|
|
Final LMS Scale = 0.3645644492
|
|
The output for WLS regression follows. Due to the size of the scaled residuals, six observations (with numbers 7, 9, 11, 20,
30, 34) were assigned zero weights in the following WLS analysis.
Output 9.1.5: Table of Weighted LS Regression
|
Weighted Least-Squares Estimation
|
|
|
|
RLS Parameter Estimates Based on LTS
|
|
Variable
|
Estimate
|
Approx
Std Err
|
t Value
|
Pr > |t|
|
Lower WCI
|
Upper WCI
|
|
VAR1
|
3.04615694
|
0.43733923
|
6.97
|
<.0001
|
2.18898779
|
3.90332608
|
|
Intercep
|
-8.5000549
|
1.92630783
|
-4.41
|
<.0001
|
-12.275549
|
-4.7245609
|
|
Weighted Sum of Squares = 4.52819451
|
|
RLS Scale Estimate = 0.3407455818
|
|
|
|
Cov Matrix of Parameter Estimates
|
|
|
VAR1
|
Intercep
|
|
VAR1
|
0.1912656038
|
-0.842128459
|
|
Intercep
|
-0.842128459
|
3.7106618752
|
|
Weighted R-squared = 0.5543573521
|
|
F(1,39) Statistic = 48.514065776
|
|
Probability = 2.3923178E-8
|
|
There are 41 points with nonzero weight.
|
|
Average Weight = 0.8723404255
|
|
The LTS regression leads to similar results:
print "*** Hertzsprung-Russell Star Data: Do LTS ***";
optn = j(8,1,.);
optn[2]= 3; /* ipri */
optn[3]= 3; /* ilsq */
optn[8]= 3; /* icov */
call lts(sc,coef,wgt,optn,b,a);
Output 9.1.6: History of Iteration Process
|
Complete Enumeration for LTS
|
|
|
|
Subset
|
Singular
|
Best
Criterion
|
Percent
|
|
271
|
5
|
0.274268
|
25
|
|
541
|
8
|
0.274268
|
50
|
|
811
|
27
|
0.274268
|
75
|
|
1081
|
45
|
0.274268
|
100
|
Minimum Criterion= 0.274268243
|
|
Least Trimmed Squares (LTS) Method
|
|
Minimizing Sum of 25 Smallest Squared Residuals.
|
|
Highest Possible Breakdown Value = 48.94 %
|
|
Selection of All 1081 Subsets of 2 Cases Out of 47
|
|
Among 1081 subsets 45 are singular.
|
|
Output 9.1.7: Results of Optimization
|
|
|
Observations of Best Subset
|
|
22
|
36
|
|
|
|
Estimated Coefficients
|
|
VAR1
|
Intercep
|
|
4.2424242424
|
-13.72633939
|
|
LTS Objective Function = 0.1829838175
|
|
Preliminary LTS Scale = 0.4525412929
|
|
Robust R Squared = 0.4038660847
|
|
Final LTS Scale = 0.3743418666
|
|
Statistics and Operations Research | SAS/IML Software