Robust Regression Examples |
Example 9.1: LMS and LTS with
Substantial Leverage Points: Hertzsprung-Russell Star Data
The following data are reported in Rousseeuw
and Leroy (1987, p. 27) and are based on
Humphreys (1978) and Vansina and De Greve (1982).
The 47 observations correspond to the 47 stars of
the CYG OB1 cluster in the direction of the constellation Cygnus.
The regressor variable (column 2) is the logarithm
of the effective temperature at the surface of the
star (), and the response variable (column 3)
is the logarithm of its light intensity ().
The results for LS and LMS on page 28 of Rousseeuw
and Leroy (1987) are based on a more precise
(five decimal places) version of the data set.
This data set is remarkable in that it contains four
substantial leverage points (representing giant stars) corresponding
to observations 11, 20, 30, and 34 that greatly affect
the results of and even regression.
ab = { 1 4.37 5.23, 2 4.56 5.74, 3 4.26 4.93,
4 4.56 5.74, 5 4.30 5.19, 6 4.46 5.46,
7 3.84 4.65, 8 4.57 5.27, 9 4.26 5.57,
10 4.37 5.12, 11 3.49 5.73, 12 4.43 5.45,
13 4.48 5.42, 14 4.01 4.05, 15 4.29 4.26,
16 4.42 4.58, 17 4.23 3.94, 18 4.42 4.18,
19 4.23 4.18, 20 3.49 5.89, 21 4.29 4.38,
22 4.29 4.22, 23 4.42 4.42, 24 4.49 4.85,
25 4.38 5.02, 26 4.42 4.66, 27 4.29 4.66,
28 4.38 4.90, 29 4.22 4.39, 30 3.48 6.05,
31 4.38 4.42, 32 4.56 5.10, 33 4.45 5.22,
34 3.49 6.29, 35 4.23 4.34, 36 4.62 5.62,
37 4.53 5.10, 38 4.45 5.22, 39 4.53 5.18,
40 4.43 5.57, 41 4.38 4.62, 42 4.45 5.06,
43 4.50 5.34, 44 4.45 5.34, 45 4.55 5.54,
46 4.45 4.98, 47 4.42 4.50 } ;
a = ab[,2]; b = ab[,3];
The following statements specify that most of the output be printed:
print "*** Hertzsprung-Russell Star Data: Do LMS ***";
optn = j(9,1,.);
optn[2]= 3; /* ipri */
optn[3]= 3; /* ilsq */
optn[8]= 3; /* icov */
call lms(sc,coef,wgt,optn,b,a);
Some simple statistics for the independent and response variables
are shown in Output 9.1.1.
Output 9.1.1: Some Simple Statistics
Median
and Mean |
|
Median |
Mean |
VAR1 |
4.42 |
4.31 |
Intercep |
1 |
1 |
Response |
5.1 |
5.0121276596 |
Dispersion
and Standard Deviation |
|
Dispersion |
StdDev |
VAR1 |
0.163086244 |
0.2908234187 |
Intercep |
0 |
0 |
Response |
0.6671709983 |
0.5712493409 |
Partial output for LS regression is shown in Output 9.1.2.
Output 9.1.2: Table of Unweighted LS Regression
Unweighted
Least-Squares Estimation |
LS
Parameter Estimates |
Variable |
Estimate |
Approx
Std
Err |
t
Value |
Pr
> |t| |
Lower
WCI |
Upper
WCI |
VAR1 |
-0.4133039 |
0.28625748 |
-1.44 |
0.1557 |
-0.9743582 |
0.14775048 |
Intercep |
6.7934673 |
1.23651563 |
5.49 |
<.0001 |
4.3699412 |
9.21699339 |
Sum
of Squares = 14.346394626 |
LS
Scale Estimate = 0.5646315343 |
Cov
Matrix of Parameter Estimates |
|
VAR1 |
Intercep |
VAR1 |
0.0819433428 |
-0.353175807 |
Intercep |
-0.353175807 |
1.5289708954 |
F(1,45)
Statistic = 2.0846120667 |
Probability
= 0.1557164396 |
|
Output 9.1.3 displays the iteration history.
Looking at the column Best Crit in the
iteration history table, you see that, with complete
enumeration, the optimal solution is quickly found.
Output 9.1.3: History of the Iteration Process
Complete
Enumeration for LMS |
Subset |
Singular |
Best
Criterion |
Percent |
271 |
5 |
0.392791 |
25 |
541 |
8 |
0.392791 |
50 |
811 |
27 |
0.392791 |
75 |
1081 |
45 |
0.392791 |
100 |
Minimum
Criterion= 0.3927910898 |
Least
Median of Squares (LMS) Method |
Minimizing
25th Ordered Squared Residual. |
Highest
Possible Breakdown Value = 48.94 % |
Selection
of All 1081 Subsets of 2 Cases Out of 47 |
Among
1081 subsets 45 are singular. |
|
The results of the optimization for LMS estimation
are displayed in Output 9.1.4.
Output 9.1.4: Results of Optimization
Observations
of Best Subset |
2 |
29 |
Estimated
Coefficients |
VAR1 |
Intercep |
3.9705882353 |
-12.62794118 |
LMS
Objective Function = 0.2620588235 |
Preliminary
LMS Scale = 0.3987301586 |
Robust
R Squared = 0.5813148789 |
Final
LMS Scale = 0.3645644492 |
|
Output 9.1.5 displays the results for WLS regression.
Due to the size of the scaled residuals, six
observations (with numbers 7, 9, 11, 20, 30, 34) were
assigned zero weights in the following WLS analysis.
The LTS regression implements the FAST-LTS algorithm, which
improves the algorithm (used in SAS/IML Version 7 and earlier versions,
denoted as V7 LTS in this chapter) in Rousseeuw and Leroy (1987)
by using techniques called
"selective iteration" and "nested extensions."
These techniques are used in the C-steps of the algorithm. See
Rousseeuw and Van Driessen (2000) for details. The FAST-LTS algorithm
significantly improves the speed of computation.
Output 9.1.5: Table of Weighted LS Regression Based on LMS
Weighted
Least-Squares Estimation |
RLS
Parameter Estimates Based on LTS |
Variable |
Estimate |
Approx
Std
Err |
t
Value |
Pr
> |t| |
Lower
WCI |
Upper
WCI |
VAR1 |
3.04615694 |
0.43733923 |
6.97 |
<.0001 |
2.18898779 |
3.90332608 |
Intercep |
-8.5000549 |
1.92630783 |
-4.41 |
<.0001 |
-12.275549 |
-4.7245609 |
Weighted
Sum of Squares = 4.52819451 |
RLS
Scale Estimate = 0.3407455818 |
Cov
Matrix of Parameter Estimates |
|
VAR1 |
Intercep |
VAR1 |
0.1912656038 |
-0.842128459 |
Intercep |
-0.842128459 |
3.7106618752 |
Weighted
R-squared = 0.5543573521 |
F(1,39)
Statistic = 48.514065776 |
Probability
= 2.3923178E-8 |
There
are 41 points with nonzero weight. |
Average
Weight = 0.8723404255 |
|
The following statements implement the LTS regression on the
Hertzsprung-Russell star data:
print "*** Hertzsprung-Russell Star Data: Do LTS ***";
optn = j(9,1,.);
optn[2]= 3; /* ipri */
optn[3]= 3; /* ilsq */
optn[8]= 3; /* icov */
call lts(sc,coef,wgt,optn,b,a);
Output 9.1.6 summarizes the information for the LTS optimization.
Output 9.1.6: Summary of Optimization
LTS: The sum of the 25 smallest squared residuals will be minimized. |
Unweighted Least-Squares Estimation |
Sum of Squares = 14.346394626 |
LS Scale Estimate = 0.5646315343 |
F(1,45) Statistic = 2.0846120667 |
Probability = 0.1557164396 |
Distribution of Residuals |
Least Trimmed Squares (LTS) Method |
Least Trimmed Squares (LTS) Method |
Minimizing Sum of 25 Smallest Squared Residuals. |
Highest Possible Breakdown Value = 48.94 % |
Selection of All 1081 Subsets of 2 Cases Out of 47 |
Among 1081 subsets 45 is/are singular. |
The best half of the entire data set obtained after full iteration consists of the cases: |
2 |
4 |
6 |
10 |
13 |
15 |
17 |
19 |
21 |
22 |
25 |
27 |
28 |
29 |
33 |
35 |
36 |
38 |
39 |
41 |
42 |
43 |
44 |
45 |
46 |
LTS Objective Function = 0.1829636959 |
Preliminary LTS Scale = 0.4524915298 |
Robust R Squared = 0.4039971838 |
Final LTS Scale = 0.3731970408 |
Distribution of Residuals |
Weighted Least-Squares Estimation |
Weighted Sum of Squares = 4.52819451 |
RLS Scale Estimate = 0.3407455818 |
Weighted R-squared = 0.5543573521 |
F(1,39) Statistic = 48.514065776 |
Probability = 2.3923178E-8 |
There are 41 points with nonzero weight. |
Average Weight = 0.8723404255 |
Distribution of Residuals |
The run has been executed successfully. |
|
Output 9.1.7 displays the optimization results
and Output 9.1.8 displays the weighted LS regression
based on LTS estimates.
Output 9.1.7: Results of Optimization
Estimated Coefficients |
VAR1 |
Intercep |
4.219182102 |
-13.6239903 |
|
Output 9.1.8: Table of Weighted LS Regression Based on LTS
LS Parameter Estimates |
Variable |
Estimate |
Approx Std Err |
t Value |
Pr > |t| |
Lower WCI |
Upper WCI |
VAR1 |
-0.4133039 |
0.28625748 |
-1.44 |
0.1557 |
-0.9743582 |
0.14775048 |
Intercep |
6.7934673 |
1.23651563 |
5.49 |
<.0001 |
4.3699412 |
9.21699339 |
Cov Matrix of Parameter Estimates |
|
VAR1 |
Intercep |
VAR1 |
0.0819433428 |
-0.353175807 |
Intercep |
-0.353175807 |
1.5289708954 |
RLS Parameter Estimates Based on LTS |
Variable |
Estimate |
Approx Std Err |
t Value |
Pr > |t| |
Lower WCI |
Upper WCI |
VAR1 |
3.04615694 |
0.43733923 |
6.97 |
<.0001 |
2.18898779 |
3.90332608 |
Intercep |
-8.5000549 |
1.92630783 |
-4.41 |
<.0001 |
-12.275549 |
-4.7245609 |
Cov Matrix of Parameter Estimates |
|
VAR1 |
Intercep |
VAR1 |
0.1912656038 |
-0.842128459 |
Intercep |
-0.842128459 |
3.7106618752 |
|