The ROBUSTREG Procedure

LTS Estimation

If the data are contaminated in the X space, M estimation might yield improper results. It is better to use the high breakdown value method. This example shows how you can use LTS estimation to deal with X-space contaminated data. The following data set, hbk, is an artificial data set that was generated by Hawkins, Bradu, and Kass (1984).

data hbk;
   input index $ x1 x2 x3 y @@;
   datalines;
 1  10.1  19.6  28.3   9.7          2   9.5  20.5  28.9  10.1
 3  10.7  20.2  31.0  10.3          4   9.9  21.5  31.7   9.5
 5  10.3  21.1  31.1  10.0          6  10.8  20.4  29.2  10.0
 7  10.5  20.9  29.1  10.8          8   9.9  19.6  28.8  10.3
 9   9.7  20.7  31.0   9.6         10   9.3  19.7  30.3   9.9
11  11.0  24.0  35.0  -0.2         12  12.0  23.0  37.0  -0.4
13  12.0  26.0  34.0   0.7         14  11.0  34.0  34.0   0.1
15   3.4   2.9   2.1  -0.4         16   3.1   2.2   0.3   0.6
17   0.0   1.6   0.2  -0.2         18   2.3   1.6   2.0   0.0
19   0.8   2.9   1.6   0.1         20   3.1   3.4   2.2   0.4
21   2.6   2.2   1.9   0.9         22   0.4   3.2   1.9   0.3
23   2.0   2.3   0.8  -0.8         24   1.3   2.3   0.5   0.7
25   1.0   0.0   0.4  -0.3         26   0.9   3.3   2.5  -0.8
27   3.3   2.5   2.9  -0.7         28   1.8   0.8   2.0   0.3
29   1.2   0.9   0.8   0.3         30   1.2   0.7   3.4  -0.3
31   3.1   1.4   1.0   0.0         32   0.5   2.4   0.3  -0.4
33   1.5   3.1   1.5  -0.6         34   0.4   0.0   0.7  -0.7
35   3.1   2.4   3.0   0.3         36   1.1   2.2   2.7  -1.0
37   0.1   3.0   2.6  -0.6         38   1.5   1.2   0.2   0.9
39   2.1   0.0   1.2  -0.7         40   0.5   2.0   1.2  -0.5
41   3.4   1.6   2.9  -0.1         42   0.3   1.0   2.7  -0.7
43   0.1   3.3   0.9   0.6         44   1.8   0.5   3.2  -0.7
45   1.9   0.1   0.6  -0.5         46   1.8   0.5   3.0  -0.4
47   3.0   0.1   0.8  -0.9         48   3.1   1.6   3.0   0.1
49   3.1   2.5   1.9   0.9         50   2.1   2.8   2.9  -0.4
51   2.3   1.5   0.4   0.7         52   3.3   0.6   1.2  -0.5
53   0.3   0.4   3.3   0.7         54   1.1   3.0   0.3   0.7
55   0.5   2.4   0.9   0.0         56   1.8   3.2   0.9   0.1
57   1.8   0.7   0.7   0.7         58   2.4   3.4   1.5  -0.1
59   1.6   2.1   3.0  -0.3         60   0.3   1.5   3.3  -0.9
61   0.4   3.4   3.0  -0.3         62   0.9   0.1   0.3   0.6
63   1.1   2.7   0.2  -0.3         64   2.8   3.0   2.9  -0.5
65   2.0   0.7   2.7   0.6         66   0.2   1.8   0.8  -0.9
67   1.6   2.0   1.2  -0.7         68   0.1   0.0   1.1   0.6
69   2.0   0.6   0.3   0.2         70   1.0   2.2   2.9   0.7
71   2.2   2.5   2.3   0.2         72   0.6   2.0   1.5  -0.2
73   0.3   1.7   2.2   0.4         74   0.0   2.2   1.6  -0.9
75   0.3   0.4   2.6   0.2
;

Both ordinary least squares (OLS) estimation and M estimation (not shown here) suggest that observations 11 to 14 are outliers. However, these four observations were generated from the underlying model, whereas observations 1 to 10 were contaminated. The reason that OLS estimation and M estimation do not pick up the contaminated observations is that they cannot distinguish good leverage points (observations 11 to 14) from bad leverage points (observations 1 to 10). In such cases, the LTS method identifies the true outliers.

The following statements invoke the ROBUSTREG procedure and use the LTS estimation method:

proc robustreg data=hbk fwls method=lts;
   model y = x1 x2 x3 / diagnostics leverage;
   id index;
run;

Figure 98.12 displays the model-fitting information and summary statistics for the response variable and independent covariates.

Figure 98.12: Model-Fitting Information and Summary Statistics

The ROBUSTREG Procedure

Model Information
Data Set	WORK.HBK
Dependent Variable	y
Number of Independent Variables	3
Number of Observations	75
Method	LTS Estimation

Summary Statistics
Variable	Q1	Median	Q3	Mean	Standard Deviation	MAD
x1	0.8000	1.8000	3.1000	3.2067	3.6526	1.9274
x2	1.0000	2.2000	3.3000	5.5973	8.2391	1.6309
x3	0.9000	2.1000	3.0000	7.2307	11.7403	1.7791
y	-0.5000	0.1000	0.7000	1.2787	3.4928	0.8896

Figure 98.13 displays information about the LTS fit, which includes the breakdown value of the LTS estimate. The breakdown value is a measure of the proportion of contamination that an estimation method can withstand and still maintain its robustness. In this example the LTS estimate minimizes the sum of 57 smallest squares of residuals. It can still estimate the true underlying model if the remaining 18 observations are contaminated. This corresponds to a breakdown value around 0.25, which is set as the default.

Figure 98.13: LTS Profile

LTS Profile
Total Number of Observations	75
Number of Squares Minimized	57
Number of Coefficients	4
Highest Possible Breakdown Value	0.2533

Figure 98.14 displays parameter estimates for covariates and scale. Two robust estimates of the scale parameter are displayed. For information about computing these estimates, see the section Final Weighted Scale Estimator. The weighted scale estimator (Wscale) is a more efficient estimator of the scale parameter.

Figure 98.14: LTS Parameter Estimates

LTS Parameter Estimates
Parameter	DF	Estimate
Intercept	1	-0.3431
x1	1	0.0901
x2	1	0.0703
x3	1	-0.0731
Scale (sLTS)	0	0.7451
Scale (Wscale)	0	0.5749

Figure 98.15 displays outlier and leverage-point diagnostics. The ID variable index is used to identify the observations. If you do not specify this ID variable, the observation number is used to identify the observations. However, the observation number depends on how the data are read. The first 10 observations are identified as outliers, and observations 11 to 14 are identified as good leverage points.

Figure 98.15: Diagnostics

Diagnostics
Obs	index	Mahalanobis Distance	Robust MCD Distance	Leverage	Standardized Robust Residual	Outlier
1	1	1.9168	29.4424	*	17.0868	*
2	2	1.8558	30.2054	*	17.8428	*
3	3	2.3137	31.8909	*	18.3063	*
4	4	2.2297	32.8621	*	16.9702	*
5	5	2.1001	32.2778	*	17.7498	*
6	6	2.1462	30.5892	*	17.5155	*
7	7	2.0105	30.6807	*	18.8801	*
8	8	1.9193	29.7994	*	18.2253	*
9	9	2.2212	31.9537	*	17.1843	*
10	10	2.3335	30.9429	*	17.8021	*
11	11	2.4465	36.6384	*	0.0406
12	12	3.1083	37.9552	*	-0.0874
13	13	2.6624	36.9175	*	1.0776
14	14	6.3816	41.0914	*	-0.7875

Figure 98.16 displays the final weighted least squares estimates. These estimates are least squares estimates that are computed after the detected outliers are deleted.

Figure 98.16: Final Weighted LS Estimates

Parameter Estimates for Final Weighted Least Squares Fit
Parameter	DF	Estimate	Standard Error	95% Confidence Limits		Chi-Square	Pr > ChiSq
Intercept	1	-0.1805	0.1044	-0.3852	0.0242	2.99	0.0840
x1	1	0.0814	0.0667	-0.0493	0.2120	1.49	0.2222
x2	1	0.0399	0.0405	-0.0394	0.1192	0.97	0.3242
x3	1	-0.0517	0.0354	-0.1210	0.0177	2.13	0.1441
Scale	0	0.5572