Robust Regression Examples: Example 9.1: LMS and LTS with Substantial Leverage Points: Hertzsprung-Russell Star Data

Robust Regression Examples

Example 9.1: LMS and LTS with Substantial Leverage Points: Hertzsprung-Russell Star Data

The following data are reported in Rousseeuw and Leroy (1987, p. 27) and are based on Humphreys (1978) and Vansina and De Greve (1982). The 47 observations correspond to the 47 stars of the CYG OB1 cluster in the direction of the constellation Cygnus. The regressor variable (column 2) is the logarithm of the effective temperature at the surface of the star (), and the response variable (column 3) is the logarithm of its light intensity (). The results for LS and LMS on page 28 of Rousseeuw and Leroy (1987) are based on a more precise (five decimal places) version of the data set. This data set is remarkable in that it contains four substantial leverage points (representing giant stars) corresponding to observations 11, 20, 30, and 34 that greatly affect the results of and even regression.

  
    ab =  { 1  4.37  5.23,   2  4.56  5.74,   3  4.26  4.93, 
            4  4.56  5.74,   5  4.30  5.19,   6  4.46  5.46, 
            7  3.84  4.65,   8  4.57  5.27,   9  4.26  5.57, 
           10  4.37  5.12,  11  3.49  5.73,  12  4.43  5.45, 
           13  4.48  5.42,  14  4.01  4.05,  15  4.29  4.26, 
           16  4.42  4.58,  17  4.23  3.94,  18  4.42  4.18, 
           19  4.23  4.18,  20  3.49  5.89,  21  4.29  4.38, 
           22  4.29  4.22,  23  4.42  4.42,  24  4.49  4.85, 
           25  4.38  5.02,  26  4.42  4.66,  27  4.29  4.66, 
           28  4.38  4.90,  29  4.22  4.39,  30  3.48  6.05, 
           31  4.38  4.42,  32  4.56  5.10,  33  4.45  5.22, 
           34  3.49  6.29,  35  4.23  4.34,  36  4.62  5.62, 
           37  4.53  5.10,  38  4.45  5.22,  39  4.53  5.18, 
           40  4.43  5.57,  41  4.38  4.62,  42  4.45  5.06, 
           43  4.50  5.34,  44  4.45  5.34,  45  4.55  5.54, 
           46  4.45  4.98,  47  4.42  4.50 } ; 
  
    a = ab[,2]; b = ab[,3];

The following statements specify that most of the output be printed:

  
    print "*** Hertzsprung-Russell Star Data: Do LMS ***"; 
       optn = j(9,1,.); 
       optn[2]= 3;    /* ipri */ 
       optn[3]= 3;    /* ilsq */ 
       optn[8]= 3;    /* icov */ 
    call lms(sc,coef,wgt,optn,b,a);

Some simple statistics for the independent and response variables are shown in Output 9.1.1.

Output 9.1.1: Some Simple Statistics


Median and Mean
	Median	Mean
VAR1	4.42	4.31
Intercep	1	1
Response	5.1	5.0121276596


Dispersion and Standard Deviation
	Dispersion	StdDev
VAR1	0.163086244	0.2908234187
Intercep	0	0
Response	0.6671709983	0.5712493409

Partial output for LS regression is shown in Output 9.1.2.

Output 9.1.2: Table of Unweighted LS Regression

Unweighted Least-Squares Estimation


LS Parameter Estimates
Variable	Estimate	Approx Std Err	t Value	Pr > \|t\|	Lower WCI	Upper WCI
VAR1	-0.4133039	0.28625748	-1.44	0.1557	-0.9743582	0.14775048
Intercep	6.7934673	1.23651563	5.49	<.0001	4.3699412	9.21699339

Sum of Squares = 14.346394626

Degrees of Freedom = 45

LS Scale Estimate = 0.5646315343


Cov Matrix of Parameter Estimates
	VAR1	Intercep
VAR1	0.0819433428	-0.353175807
Intercep	-0.353175807	1.5289708954

R-squared = 0.0442737441

F(1,45) Statistic = 2.0846120667

Probability = 0.1557164396

Output 9.1.3 displays the iteration history. Looking at the column Best Crit in the iteration history table, you see that, with complete enumeration, the optimal solution is quickly found.

Output 9.1.3: History of the Iteration Process

Complete Enumeration for LMS


Subset	Singular	Best Criterion	Percent
271	5	0.392791	25
541	8	0.392791	50
811	27	0.392791	75
1081	45	0.392791	100

Minimum Criterion= 0.3927910898

Least Median of Squares (LMS) Method

Minimizing 25th Ordered Squared Residual.

Highest Possible Breakdown Value = 48.94 %

Selection of All 1081 Subsets of 2 Cases Out of 47

Among 1081 subsets 45 are singular.

The results of the optimization for LMS estimation are displayed in Output 9.1.4.

Output 9.1.4: Results of Optimization


Observations of Best Subset
2	29


Estimated Coefficients
VAR1	Intercep
3.9705882353	-12.62794118

LMS Objective Function = 0.2620588235

Preliminary LMS Scale = 0.3987301586

Robust R Squared = 0.5813148789

Final LMS Scale = 0.3645644492

Output 9.1.5 displays the results for WLS regression. Due to the size of the scaled residuals, six observations (with numbers 7, 9, 11, 20, 30, 34) were assigned zero weights in the following WLS analysis.

The LTS regression implements the FAST-LTS algorithm, which improves the algorithm (used in SAS/IML Version 7 and earlier versions, denoted as V7 LTS in this chapter) in Rousseeuw and Leroy (1987) by using techniques called "selective iteration" and "nested extensions." These techniques are used in the C-steps of the algorithm. See Rousseeuw and Van Driessen (2000) for details. The FAST-LTS algorithm significantly improves the speed of computation.

Output 9.1.5: Table of Weighted LS Regression Based on LMS

Weighted Least-Squares Estimation


RLS Parameter Estimates Based on LTS
Variable	Estimate	Approx Std Err	t Value	Pr > \|t\|	Lower WCI	Upper WCI
VAR1	3.04615694	0.43733923	6.97	<.0001	2.18898779	3.90332608
Intercep	-8.5000549	1.92630783	-4.41	<.0001	-12.275549	-4.7245609

Weighted Sum of Squares = 4.52819451

Degrees of Freedom = 39

RLS Scale Estimate = 0.3407455818


Cov Matrix of Parameter Estimates
	VAR1	Intercep
VAR1	0.1912656038	-0.842128459
Intercep	-0.842128459	3.7106618752

Weighted R-squared = 0.5543573521

F(1,39) Statistic = 48.514065776

Probability = 2.3923178E-8

There are 41 points with nonzero weight.

Average Weight = 0.8723404255

The following statements implement the LTS regression on the Hertzsprung-Russell star data:

  
    print "*** Hertzsprung-Russell Star Data: Do LTS ***"; 
       optn = j(9,1,.); 
       optn[2]= 3;    /* ipri */ 
       optn[3]= 3;    /* ilsq */ 
       optn[8]= 3;    /* icov */ 
    call lts(sc,coef,wgt,optn,b,a);

Output 9.1.6 summarizes the information for the LTS optimization.

Output 9.1.6: Summary of Optimization

LTS: The sum of the 25 smallest squared residuals will be minimized.

Unweighted Least-Squares Estimation

Sum of Squares = 14.346394626

Degrees of Freedom = 45

LS Scale Estimate = 0.5646315343

R-squared = 0.0442737441

F(1,45) Statistic = 2.0846120667

Probability = 0.1557164396

Distribution of Residuals

Least Trimmed Squares (LTS) Method

Minimizing Sum of 25 Smallest Squared Residuals.

Highest Possible Breakdown Value = 48.94 %

Selection of All 1081 Subsets of 2 Cases Out of 47

Among 1081 subsets 45 is/are singular.

The best half of the entire data set obtained after full iteration consists of the cases:

LTS Objective Function = 0.1829636959

Preliminary LTS Scale = 0.4524915298

Robust R Squared = 0.4039971838

Final LTS Scale = 0.3731970408

Distribution of Residuals

Weighted Least-Squares Estimation

Weighted Sum of Squares = 4.52819451

Degrees of Freedom = 39

RLS Scale Estimate = 0.3407455818

Weighted R-squared = 0.5543573521

F(1,39) Statistic = 48.514065776

Probability = 2.3923178E-8

There are 41 points with nonzero weight.

Average Weight = 0.8723404255

Distribution of Residuals

The run has been executed successfully.

Output 9.1.7 displays the optimization results and Output 9.1.8 displays the weighted LS regression based on LTS estimates.

Output 9.1.7: Results of Optimization

Estimated Coefficients
VAR1	Intercep
4.219182102	-13.6239903

Output 9.1.8: Table of Weighted LS Regression Based on LTS

LS Parameter Estimates
Variable	Estimate	Approx Std Err	t Value	Pr > \|t\|	Lower WCI	Upper WCI
VAR1	-0.4133039	0.28625748	-1.44	0.1557	-0.9743582	0.14775048
Intercep	6.7934673	1.23651563	5.49	<.0001	4.3699412	9.21699339

Cov Matrix of Parameter Estimates
	VAR1	Intercep
VAR1	0.0819433428	-0.353175807
Intercep	-0.353175807	1.5289708954

RLS Parameter Estimates Based on LTS
Variable	Estimate	Approx Std Err	t Value	Pr > \|t\|	Lower WCI	Upper WCI
VAR1	3.04615694	0.43733923	6.97	<.0001	2.18898779	3.90332608
Intercep	-8.5000549	1.92630783	-4.41	<.0001	-12.275549	-4.7245609

Cov Matrix of Parameter Estimates
	VAR1	Intercep
VAR1	0.1912656038	-0.842128459
Intercep	-0.842128459	3.7106618752

Top of Page