LTS Estimation

If the data are contaminated in the x-space, M estimation does not do well. The following example shows how you can use LTS estimation to deal with this situation.

data hbk;
   input index$ x1 x2 x3 y @@;
   datalines;
1   10.1  19.6  28.3   9.7     39  2.1  0.0  1.2  -0.7
2    9.5  20.5  28.9  10.1     40  0.5  2.0  1.2  -0.5
3   10.7  20.2  31.0  10.3     41  3.4  1.6  2.9  -0.1
4    9.9  21.5  31.7   9.5     42  0.3  1.0  2.7  -0.7
5   10.3  21.1  31.1  10.0     43  0.1  3.3  0.9   0.6
6   10.8  20.4  29.2  10.0     44  1.8  0.5  3.2  -0.7
7   10.5  20.9  29.1  10.8     45  1.9  0.1  0.6  -0.5
8    9.9  19.6  28.8  10.3     46  1.8  0.5  3.0  -0.4
9    9.7  20.7  31.0   9.6     47  3.0  0.1  0.8  -0.9
10   9.3  19.7  30.3   9.9     48  3.1  1.6  3.0   0.1
11  11.0  24.0  35.0  -0.2     49  3.1  2.5  1.9   0.9
12  12.0  23.0  37.0  -0.4     50  2.1  2.8  2.9  -0.4
13  12.0  26.0  34.0   0.7     51  2.3  1.5  0.4   0.7
14  11.0  34.0  34.0   0.1     52  3.3  0.6  1.2  -0.5
15   3.4   2.9   2.1  -0.4     53  0.3  0.4  3.3   0.7
16   3.1   2.2   0.3   0.6     54  1.1  3.0  0.3   0.7
17   0.0   1.6   0.2  -0.2     55  0.5  2.4  0.9   0.0
18   2.3   1.6   2.0   0.0     56  1.8  3.2  0.9   0.1
19   0.8   2.9   1.6   0.1     57  1.8  0.7  0.7   0.7
20   3.1   3.4   2.2   0.4     58  2.4  3.4  1.5  -0.1
21   2.6   2.2   1.9   0.9     59  1.6  2.1  3.0  -0.3
22   0.4   3.2   1.9   0.3     60  0.3  1.5  3.3  -0.9
23   2.0   2.3   0.8  -0.8     61  0.4  3.4  3.0  -0.3
24   1.3   2.3   0.5   0.7     62  0.9  0.1  0.3   0.6
25   1.0   0.0   0.4  -0.3     63  1.1  2.7  0.2  -0.3
26   0.9   3.3   2.5  -0.8     64  2.8  3.0  2.9  -0.5
27   3.3   2.5   2.9  -0.7     65  2.0  0.7  2.7   0.6
28   1.8   0.8   2.0   0.3     66  0.2  1.8  0.8  -0.9
29   1.2   0.9   0.8   0.3     67  1.6  2.0  1.2  -0.7
30   1.2   0.7   3.4  -0.3     68  0.1  0.0  1.1   0.6
31   3.1   1.4   1.0   0.0     69  2.0  0.6  0.3   0.2
32   0.5   2.4   0.3  -0.4     70  1.0  2.2  2.9   0.7
33   1.5   3.1   1.5  -0.6     71  2.2  2.5  2.3   0.2
34   0.4   0.0   0.7  -0.7     72  0.6  2.0  1.5  -0.2
35   3.1   2.4   3.0   0.3     73  0.3  1.7  2.2   0.4
36   1.1   2.2   2.7  -1.0     74  0.0  2.2  1.6  -0.9
37   0.1   3.0   2.6  -0.6     75  0.3  0.4  2.6   0.2
38   1.5   1.2   0.2   0.9
;

The data set hbk is an artificial data set generated by Hawkins, Bradu, and Kass (1984). Both ordinary least squares (OLS) estimation and M estimation (not shown here) suggest that observations 11 to 14 are outliers. However, these four observations were generated from the underlying model, whereas observations 1 to 10 were contaminated. The reason that OLS estimation and M estimation do not pick up the contaminated observations is that they cannot distinguish good leverage points (observations 11 to 14) from bad leverage points (observations 1 to 10). In such cases, the LTS method identifies the true outliers.

The following statements invoke the ROBUSTREG procedure with the LTS estimation method:

proc robustreg data=hbk fwls method=lts;
   model y = x1 x2 x3 / diagnostics leverage;
   id index;
run;

Figure 77.12 displays the model fitting information and summary statistics for the response variable and independent covariates.

Figure 77.12 Model Fitting Information and Summary Statistics
The ROBUSTREG Procedure

Model Information
Data Set WORK.HBK
Dependent Variable y
Number of Independent Variables 3
Number of Observations 75
Method LTS Estimation

Summary Statistics
Variable Q1 Median Q3 Mean Standard
Deviation
MAD
x1 0.8000 1.8000 3.1000 3.2067 3.6526 1.9274
x2 1.0000 2.2000 3.3000 5.5973 8.2391 1.6309
x3 0.9000 2.1000 3.0000 7.2307 11.7403 1.7791
y -0.5000 0.1000 0.7000 1.2787 3.4928 0.8896

Figure 77.13 displays information about the LTS fit, which includes the breakdown value of the LTS estimate. The breakdown value is a measure of the proportion of contamination that an estimation method can withstand and still maintain its robustness. In this example the LTS estimate minimizes the sum of 57 smallest squares of residuals. It can still estimate the true underlying model if the remaining 18 observations are contaminated. This corresponds to the breakdown value around 0.25, which is set as the default.

Figure 77.13 LTS Profile
LTS Profile
Total Number of Observations 75
Number of Squares Minimized 57
Number of Coefficients 4
Highest Possible Breakdown Value 0.2533

Figure 77.14 displays parameter estimates for covariates and scale. Two robust estimates of the scale parameter are displayed. See the section Final Weighted Scale Estimator for how these estimates are computed. The weighted scale estimator (Wscale) is a more efficient estimator of the scale parameter.

Figure 77.14 LTS Parameter Estimates
LTS Parameter Estimates
Parameter DF Estimate
Intercept 1 -0.3431
x1 1 0.0901
x2 1 0.0703
x3 1 -0.0731
Scale (sLTS) 0 0.7451
Scale (Wscale) 0 0.5749

Figure 77.15 displays outlier and leverage-point diagnostics. The ID variable index is used to identify the observations. If you do not specify this ID variable, the observation number is used to identify the observations. However, the observation number depends on how the data are read. The first 10 observations are identified as outliers, and observations 11 to 14 are identified as good leverage points.

Figure 77.15 Diagnostics
Diagnostics
Obs index Mahalanobis Distance Robust MCD Distance Leverage Standardized
Robust Residual
Outlier
1 1 1.9168 29.4424 * 17.0868 *
3 2 1.8558 30.2054 * 17.8428 *
5 3 2.3137 31.8909 * 18.3063 *
7 4 2.2297 32.8621 * 16.9702 *
9 5 2.1001 32.2778 * 17.7498 *
11 6 2.1462 30.5892 * 17.5155 *
13 7 2.0105 30.6807 * 18.8801 *
15 8 1.9193 29.7994 * 18.2253 *
17 9 2.2212 31.9537 * 17.1843 *
19 10 2.3335 30.9429 * 17.8021 *
21 11 2.4465 36.6384 * 0.0406  
23 12 3.1083 37.9552 * -0.0874  
25 13 2.6624 36.9175 * 1.0776  
27 14 6.3816 41.0914 * -0.7875  

Figure 77.16 displays the final weighted least squares estimates. These estimates are least squares estimates computed after deleting the detected outliers.

Figure 77.16 Final Weighted LS Estimates
Parameter Estimates for Final Weighted Least Squares Fit
Parameter DF Estimate Standard Error 95% Confidence Limits Chi-Square Pr > ChiSq
Intercept 1 -0.1805 0.1044 -0.3852 0.0242 2.99 0.0840
x1 1 0.0814 0.0667 -0.0493 0.2120 1.49 0.2222
x2 1 0.0399 0.0405 -0.0394 0.1192 0.97 0.3242
x3 1 -0.0517 0.0354 -0.1210 0.0177 2.13 0.1441
Scale 0 0.5572