LTS Call
performs robust regression
- CALL LTS( sc, coef, wgt, opt, y<, <x><,
sorb>>);
A new algorithm, FAST-LTS, was added to the LTS subroutine in SAS/IML
Release 8.1. The FAST-LTS algorithm
is set as the default algorithm. The original algorithm is kept for
convenience, and can be used temporarily by specifying optn[9]=1.
Eventually the original algorithm will be replaced
with the new FAST-LTS algorithm.
The original algorithm for the LTS subroutine and the algorithm used
in the LMS subroutine are
based on the PROGRESS program by Rousseeuw and Leroy (1987).
Rousseeuw and Hubert (1996) prepared a new version of
PROGRESS to facilitate its inclusion in SAS software,
and they have incorporated several recent developments.
Among other things, the new version of PROGRESS now yields the
exact LMS for simple regression, and the program uses a new
definition of the robust coefficient of determination (R2).
Therefore, the outputs may differ slightly from those
given in Rousseeuw and Leroy (1987) or those obtained
from software based on the older version of PROGRESS.
FAST-LTS
Least trimmed squares (LTS) regression is based on the subset of h cases
(out of n) whose least squares fit possesses the smallest sum of squared
residuals. The coverage h may be set between n/2 and n. The
LTS method was proposed by Rousseeuw (1984, p. 876) as a highly robust
regression estimator, with breakdown value (n-h)/n. It turned
out that the computation time of the previous LTS algorithm grew too fast
with the size of the data set, precluding their use for data mining.
Rousseeuw and Van Driessen (1998) developed a new algorithm called
FAST-LTS. The basic idea is an inequality involving order statistics
and sums of squared residuals. Based on this inequality, techniques
called "selective iteration"
and "nested extensions" are developed
in Rousseeuw and Van Driessen (1998). The new LTS algorithm implements
these techniques to achieve faster computation. The intercept adjustment
technique is also used in this new algorithm. For small data sets,
FAST-LTS typically finds the exact LTS, whereas for larger data sets it
gives more accurate results than the previous LTS algorithm and is faster
by orders of magnitude. The new algorithm is described briefly as follows;
refer to Rousseeuw and Van Driessen (1998) for details:
- The default h is (n+p+1)/2, where p is the number of the
independent variables. You can choose any integer h with
. The LTS's breakdown point
(n-h+1)/n is reported. If you are sure that the data contains
less than 25% of contamination, you can obtain
a good compromise between breakdown
value and statistical efficiency by putting h=[.75n].
- If p=1 (univariate data), then compute the LTS estimator by the exact
algorithm of Rousseeuw and Leroy (1987, pp. 171-172) and stop.
- From here on,
. If n<600, draw a random p-subset and compute
the regression coefficients using these p points (if the regression
is degenerate, draw another p-subset). Compute the absolute residuals
for all points in the data set and select the first h points with smallest
absolute residuals. From this selected h-subset, carry out C-steps
(Concentration step; refer to Rousseeuw and Van Driessen [1998] for details) until
convergence. Repeat this procedure 500 or
times
(as determined by opt[5] of LTS Call)
and find the ten (at most) solutions with the lowest sums of h squared
residuals. For each of these ten best solutions, take C-steps until
convergence and find the best final solution.
- If n > 600, construct up to five disjoint random subsets with sizes
as equal as possible, but not to exceed 300. Inside each subset, repeat
the procedure in step 3 500/5 = 100 times and keep the ten best
solutions. Pool the subsets, yielding the merged set of size nmerged.
In the merged set, for each of the 5×10 = 50 best solutions,
carry out two C-steps using
nmerged and hmerged = [nmerged (h/n)]
and keep the ten best solutions. In the full data set, for
each of these ten best solutions, take C-steps using n and h until
convergence and find the best final solution.
OPTION Changes
The previous LTS algorithm is used if optn[9] = 1; the FAST-LTS
algorithm is set as default (or optn[9] = 0).
OUTPUT Changes
Because of the change in algorithm, the output from the FAST-LTS
algorithm is different:
- The "Complete Enumeration for LTS" table and the
"Resistant Diagnostic" table do not apply for the FAST-LTS
algorithm and are not displayed.
- The analysis based on "Observations of Best Subset" of size p,
where p is the number of the independent variables, is replaced
by the analysis based on observations of the best h of the entire
data set obtained after full iteration, which gives the exact LTS
estimator and covariances for small data set (
). LTS Objective Function, Preliminary LTS Scale, Robust R
Squared, and Final LTS Scale are also reported for the LTS estimator.
- The LTS residuals are changed, because of the change of the LTS estimator.
- The "Coef" vector does not include the best subset.
See the following illustrative example for details.
Illustrative Example
The following example shows the difference between
the two algorithms. The second output is generated by the FAST-LTS
algorithm.
title1 'Compare two algorithms for LTS';
proc iml;
reset noname;
x = {42, 37, 37, 28, 18, 18, 19, 20, 15};
y = {80, 80, 75, 62, 62, 62, 62, 62, 58};
optn = j(9, 1, .);
optn[1]= 0; /* --- with intercept --- */
optn[2]= 4; /* --- print all output --- */
optn[3]= 3; /* --- compute LS and WLS --- */
optn[8]= 3; /* --- covariance matrices --- */
optn[9]= 1; /* --- Version 7 LTS --- */
call lts(sc, coef, wgt, optn, y, x);
print "sc is " sc, "coef is " coef;
optn[9]= 0; /* --- FAST-LTS algorithm --- */
call lts(sc, coef, wgt, optn, y, x);
print "sc is " sc, "coef is " coef;
quit;
Comparison of the outputs is summarized as follows:
- The summary statistics for dependent and independent variables
and the results of the classical (unweighted) least-squares estimation
(outputs in Page 1 and first half of Page 2) do not change.
- The "Complete Enumeration for LTS" table on page 1 and the "Resistant
Diagnostic" table on page 3 of the first output is eliminated.
The analysis based on "Observations of Best Subset" of size p,
where p is the number of the independent variables, is replaced
by the analysis based on observations of the best h of the entire
data set obtained after full iteration.
- The best ten (at most) estimates before the final search are available
by setting a proper option (see pages 2 and 3 of the second output). The
first one is the best with the smallest objective value.
- The results of the weighted least squared estimation (page 4 for the first
output, page 5 for the second output) do not change because the two algorithms
detect the same outliers in this simple example. Results will
be changed if they detect different outliers.
- The "Coef" vector does not include the best subset (page 6 of the second
output).
Output from Previous Algorithm for LTS
| Compare two algorithms for LTS |
| LTS: The sum of the 6 smallest squared residuals will be minimized. |
| Median and Mean |
| |
Median |
Mean |
| VAR1 |
20 |
26 |
| Intercep |
1 |
1 |
| Response |
62 |
67 |
| Dispersion and Standard Deviation |
| |
Dispersion |
StdDev |
| VAR1 |
7.4130110925 |
10.22252415 |
| Intercep |
0 |
0 |
| Response |
7.3806276975 |
8.7177978871 |
| Unweighted Least-Squares Estimation |
| LS Parameter Estimates |
| Variable |
Estimate |
Approx Std Err |
t Value |
Pr > |t| |
Lower WCI |
Upper WCI |
| VAR1 |
0.80502392 |
0.10637482 |
7.57 |
0.0001 |
0.59653312 |
1.01351473 |
| Intercep |
46.069378 |
2.94965086 |
15.62 |
<.0001 |
40.2881685 |
51.8505874 |
| Sum of Squares = 66.218899522 |
| LS Scale Estimate = 3.0756857429 |
| Cov Matrix of Parameter Estimates |
| |
VAR1 |
Intercep |
| VAR1 |
0.0113156014 |
-0.294205637 |
| Intercep |
-0.294205637 |
8.7004402045 |
| F(1,7) Statistic = 57.271681208 |
| Probability = 0.0001297174 |
|
| LS Residuals |
| N |
Observed |
Estimated |
Residual |
Res / S |
| 1 |
80.000000 |
79.880383 |
0.119617 |
0.038891 |
| 2 |
80.000000 |
75.855263 |
4.144737 |
1.347581 |
| 3 |
75.000000 |
75.855263 |
-0.855263 |
-0.278072 |
| 4 |
62.000000 |
68.610048 |
-6.610048 |
-2.149130 |
| 5 |
62.000000 |
60.559809 |
1.440191 |
0.468251 |
| 6 |
62.000000 |
60.559809 |
1.440191 |
0.468251 |
| 7 |
62.000000 |
61.364833 |
0.635167 |
0.206512 |
| 8 |
62.000000 |
62.169856 |
-0.169856 |
-0.055226 |
| 9 |
58.000000 |
58.144737 |
-0.144737 |
-0.047058 |
| Distribution of Residuals |
| MinRes |
1st Qu. |
Median |
Mean |
3rd Qu. |
MaxRes |
| -6.610047847 |
-0.512559809 |
0.1196172249 |
-2.36848E-15 |
1.0376794258 |
4.1447368421 |
| There are 36 subsets of 2 cases out of 9 cases. |
| All 36 subsets will be considered. |
| Complete Enumeration for LTS |
| Subset |
Singular |
Best Criterion |
Percent |
| 10 |
1 |
0.084493 |
27 |
| 18 |
1 |
0.084493 |
50 |
| 28 |
2 |
0.084493 |
77 |
| 36 |
2 |
0.084493 |
100 |
| Minimum Criterion= 0.0844927545 |
| Least Trimmed Squares (LTS) Method |
| Minimizing Sum of 6 Smallest Squared Residuals. |
| Highest Possible Breakdown Value = 44.44 % |
| Selection of All 36 Subsets of 2 Cases Out of 9 |
| Among 36 subsets 2 is/are singular. |
| Observations of Best Subset |
| 1 |
5 |
| Estimated Coefficients |
| VAR1 |
Intercep |
| 0.75 |
47.916666667 |
|
|
| LTS Objective Function = 0.6236095645 |
| Preliminary LTS Scale = 1.189465671 |
| Final LTS Scale = 0.8595864639 |
| LTS Residuals |
| N |
Observed |
Estimated |
Residual |
Res / S |
| 1 |
80.000000 |
79.416667 |
0.583333 |
0.678621 |
| 2 |
80.000000 |
75.666667 |
4.333333 |
5.041184 |
| 3 |
75.000000 |
75.666667 |
-0.666667 |
-0.775567 |
| 4 |
62.000000 |
68.916667 |
-6.916667 |
-8.046505 |
| 5 |
62.000000 |
61.416667 |
0.583333 |
0.678621 |
| 6 |
62.000000 |
61.416667 |
0.583333 |
0.678621 |
| 7 |
62.000000 |
62.166667 |
-0.166667 |
-0.193892 |
| 8 |
62.000000 |
62.916667 |
-0.916667 |
-1.066404 |
| 9 |
58.000000 |
59.166667 |
-1.166667 |
-1.357242 |
| Distribution of Residuals |
| MinRes |
1st Qu. |
Median |
Mean |
3rd Qu. |
MaxRes |
| -6.916666667 |
-1.041666667 |
-0.166666667 |
-0.416666667 |
0.5833333333 |
4.3333333333 |
| Resistant Diagnostic |
| N |
U |
Resistant Diagnostic |
| 1 |
12.521981 |
5.600000 |
| 2 |
12.521981 |
5.600000 |
| 3 |
9.167879 |
4.100000 |
| 4 |
13.339459 |
5.965588 |
| 5 |
1.709204 |
0.764379 |
| 6 |
1.709204 |
0.764379 |
| 7 |
0.642081 |
0.287147 |
| 8 |
1.697749 |
0.759257 |
| 9 |
2.236068 |
1.000000 |
|
| Weighted Least-Squares Estimation |
| RLS Parameter Estimates Based on LTS |
| Variable |
Estimate |
Approx Std Err |
t Value |
Pr > |t| |
Lower WCI |
Upper WCI |
| VAR1 |
0.76455907 |
0.03125286 |
24.46 |
<.0001 |
0.70330458 |
0.82581356 |
| Intercep |
47.3985025 |
0.815574 |
58.12 |
<.0001 |
45.8000068 |
48.9969982 |
| Weighted Sum of Squares = 3.3544093178 |
| RLS Scale Estimate = 0.819073784 |
| Cov Matrix of Parameter Estimates |
| |
VAR1 |
Intercep |
| VAR1 |
0.0009767415 |
-0.02358133 |
| Intercep |
-0.02358133 |
0.6651609492 |
| Weighted R-squared = 0.9917145853 |
| F(1,5) Statistic = 598.47009637 |
| Probability = 2.1279132E-6 |
| There are 7 points with nonzero weight. |
| Average Weight = 0.7777777778 |
| Weighted LS Residuals |
| N |
Observed |
Estimated |
Residual |
Res / S |
Weight |
| 1 |
80.000000 |
79.509983 |
0.490017 |
0.598257 |
1.000000 |
| 2 |
80.000000 |
75.687188 |
4.312812 |
5.265474 |
0 |
| 3 |
75.000000 |
75.687188 |
-0.687188 |
-0.838982 |
1.000000 |
| 4 |
62.000000 |
68.806156 |
-6.806156 |
-8.309577 |
0 |
| 5 |
62.000000 |
61.160566 |
0.839434 |
1.024858 |
1.000000 |
| 6 |
62.000000 |
61.160566 |
0.839434 |
1.024858 |
1.000000 |
| 7 |
62.000000 |
61.925125 |
0.074875 |
0.091414 |
1.000000 |
| 8 |
62.000000 |
62.689684 |
-0.689684 |
-0.842029 |
1.000000 |
| 9 |
58.000000 |
58.866889 |
-0.866889 |
-1.058377 |
1.000000 |
| Distribution of Residuals |
| MinRes |
1st Qu. |
Median |
Mean |
3rd Qu. |
MaxRes |
| -6.806156406 |
-0.77828619 |
0.074875208 |
-0.27703827 |
0.6647254576 |
4.31281198 |
| The run has been executed successfully. |
|
| sc is |
6 |
| |
36 |
| |
2 |
| |
7 |
| |
0.6236096 |
| |
1.1894657 |
| |
0.8595865 |
| |
0.825 |
| |
1.9073884 |
| |
. |
| |
0.8190738 |
| |
3.3544093 |
| |
0.9917146 |
| |
598.4701 |
| |
. |
| |
. |
| |
. |
| |
. |
| |
. |
| |
. |
| coef is |
0.75 |
47.916667 |
| |
1 |
5 |
| |
0.7645591 |
47.398502 |
| |
0.0312529 |
0.815574 |
| |
24.463648 |
58.11674 |
| |
2.1279E-6 |
2.8538E-8 |
| |
0.7033046 |
45.800007 |
| |
0.8258136 |
48.996998 |
|
Output from FAST-LTS Algorithm
| Compare two algorithms for LTS |
| LTS: The sum of the 6 smallest squared residuals will be minimized. |
| Median and Mean |
| |
Median |
Mean |
| VAR1 |
20 |
26 |
| Intercep |
1 |
1 |
| Response |
62 |
67 |
| Dispersion and Standard Deviation |
| |
Dispersion |
StdDev |
| VAR1 |
7.4130110925 |
10.22252415 |
| Intercep |
0 |
0 |
| Response |
7.3806276975 |
8.7177978871 |
| Unweighted Least-Squares Estimation |
| LS Parameter Estimates |
| Variable |
Estimate |
Approx Std Err |
t Value |
Pr > |t| |
Lower WCI |
Upper WCI |
| VAR1 |
0.80502392 |
0.10637482 |
7.57 |
0.0001 |
0.59653312 |
1.01351473 |
| Intercep |
46.069378 |
2.94965086 |
15.62 |
<.0001 |
40.2881685 |
51.8505874 |
| Sum of Squares = 66.218899522 |
| LS Scale Estimate = 3.0756857429 |
| Cov Matrix of Parameter Estimates |
| |
VAR1 |
Intercep |
| VAR1 |
0.0113156014 |
-0.294205637 |
| Intercep |
-0.294205637 |
8.7004402045 |
| F(1,7) Statistic = 57.271681208 |
| Probability = 0.0001297174 |
|
| LS Residuals |
| N |
Observed |
Estimated |
Residual |
Res / S |
| 1 |
80.000000 |
79.880383 |
0.119617 |
0.038891 |
| 2 |
80.000000 |
75.855263 |
4.144737 |
1.347581 |
| 3 |
75.000000 |
75.855263 |
-0.855263 |
-0.278072 |
| 4 |
62.000000 |
68.610048 |
-6.610048 |
-2.149130 |
| 5 |
62.000000 |
60.559809 |
1.440191 |
0.468251 |
| 6 |
62.000000 |
60.559809 |
1.440191 |
0.468251 |
| 7 |
62.000000 |
61.364833 |
0.635167 |
0.206512 |
| 8 |
62.000000 |
62.169856 |
-0.169856 |
-0.055226 |
| 9 |
58.000000 |
58.144737 |
-0.144737 |
-0.047058 |
| Distribution of Residuals |
| MinRes |
1st Qu. |
Median |
Mean |
3rd Qu. |
MaxRes |
| -6.610047847 |
-0.512559809 |
0.1196172249 |
-2.36848E-15 |
1.0376794258 |
4.1447368421 |
| Least Trimmed Squares (LTS) Method |
| The (at most) 10 Best Estimates |
| Objective Value [1]: 0.0428203092 |
| Estimated Coefficients |
| VAR1 |
Intercep |
| 0.7521545304 |
0.1250675364 |
| Objective Value [2]: 0.0454534796 |
|
|
| Estimated Coefficients |
| VAR1 |
Intercep |
| 0.7773132092 |
0.0679381212 |
| Objective Value [3]: 0.0458503276 |
| Estimated Coefficients |
| VAR1 |
Intercep |
| 0.7857569344 |
0.0875053435 |
| Objective Value [4]: 0.0470161862 |
| Estimated Coefficients |
| VAR1 |
Intercep |
| 0.7454450208 |
0.1033087106 |
| Objective Value [5]: 0.0504563846 |
| Estimated Coefficients |
| VAR1 |
Intercep |
| 0.987904817 |
0.1606655488 |
| Objective Value [6]: 0.1790484378 |
| Estimated Coefficients |
| VAR1 |
Intercep |
| 0.1926222834 |
-0.081665104 |
| Objective Value [7]: 1.797693E308 |
| Least Trimmed Squares (LTS) Method |
| Minimizing Sum of 6 Smallest Squared Residuals. |
| Highest Possible Breakdown Value = 44.44 % |
| Selection of All 36 Subsets of 2 Cases Out of 9 |
| Among 36 subsets 2 is/are singular. |
| The best half of the entire data set obtained after full iteration consists of the cases: |
|
| Estimated Coefficients |
| VAR1 |
Intercep |
| 0.7488687783 |
47.945701357 |
| LTS Objective Function = 0.6235087791 |
| Preliminary LTS Scale = 1.1892734341 |
| Robust R Squared = 0.819730444 |
| Final LTS Scale = 0.8627851118 |
| LTS Residuals |
| N |
Observed |
Estimated |
Residual |
Res / S |
| 1 |
80.000000 |
79.398190 |
0.601810 |
0.697520 |
| 2 |
80.000000 |
75.653846 |
4.346154 |
5.037354 |
| 3 |
75.000000 |
75.653846 |
-0.653846 |
-0.757832 |
| 4 |
62.000000 |
68.914027 |
-6.914027 |
-8.013614 |
| 5 |
62.000000 |
61.425339 |
0.574661 |
0.666053 |
| 6 |
62.000000 |
61.425339 |
0.574661 |
0.666053 |
| 7 |
62.000000 |
62.174208 |
-0.174208 |
-0.201914 |
| 8 |
62.000000 |
62.923077 |
-0.923077 |
-1.069880 |
| 9 |
58.000000 |
59.178733 |
-1.178733 |
-1.366195 |
| Distribution of Residuals |
| MinRes |
1st Qu. |
Median |
Mean |
3rd Qu. |
MaxRes |
| -6.914027149 |
-1.050904977 |
-0.174208145 |
-0.416289593 |
0.5746606335 |
4.3461538462 |
|
| Weighted Least-Squares Estimation |
| RLS Parameter Estimates Based on LTS |
| Variable |
Estimate |
Approx Std Err |
t Value |
Pr > |t| |
Lower WCI |
Upper WCI |
| VAR1 |
0.76455907 |
0.03125286 |
24.46 |
<.0001 |
0.70330458 |
0.82581356 |
| Intercep |
47.3985025 |
0.815574 |
58.12 |
<.0001 |
45.8000068 |
48.9969982 |
| Weighted Sum of Squares = 3.3544093178 |
| RLS Scale Estimate = 0.819073784 |
| Cov Matrix of Parameter Estimates |
| |
VAR1 |
Intercep |
| VAR1 |
0.0009767415 |
-0.02358133 |
| Intercep |
-0.02358133 |
0.6651609492 |
| Weighted R-squared = 0.9917145853 |
| F(1,5) Statistic = 598.47009637 |
| Probability = 2.1279132E-6 |
| There are 7 points with nonzero weight. |
| Average Weight = 0.7777777778 |
| Weighted LS Residuals |
| N |
Observed |
Estimated |
Residual |
Res / S |
Weight |
| 1 |
80.000000 |
79.509983 |
0.490017 |
0.598257 |
1.000000 |
| 2 |
80.000000 |
75.687188 |
4.312812 |
5.265474 |
0 |
| 3 |
75.000000 |
75.687188 |
-0.687188 |
-0.838982 |
1.000000 |
| 4 |
62.000000 |
68.806156 |
-6.806156 |
-8.309577 |
0 |
| 5 |
62.000000 |
61.160566 |
0.839434 |
1.024858 |
1.000000 |
| 6 |
62.000000 |
61.160566 |
0.839434 |
1.024858 |
1.000000 |
| 7 |
62.000000 |
61.925125 |
0.074875 |
0.091414 |
1.000000 |
| 8 |
62.000000 |
62.689684 |
-0.689684 |
-0.842029 |
1.000000 |
| 9 |
58.000000 |
58.866889 |
-0.866889 |
-1.058377 |
1.000000 |
| Distribution of Residuals |
| MinRes |
1st Qu. |
Median |
Mean |
3rd Qu. |
MaxRes |
| -6.806156406 |
-0.77828619 |
0.074875208 |
-0.27703827 |
0.6647254576 |
4.31281198 |
| The run has been executed successfully. |
|
| sc is |
6 |
| |
36 |
| |
2 |
| |
7 |
| |
0.6235088 |
| |
1.1892734 |
| |
0.8627851 |
| |
0.8197304 |
| |
1.879 |
| |
. |
| |
0.8190738 |
| |
3.3544093 |
| |
0.9917146 |
| |
598.4701 |
| |
. |
| |
. |
| |
. |
| |
. |
| |
. |
| |
. |
| coef is |
0.7488688 |
47.945701 |
| |
0.7645591 |
47.398502 |
| |
0.0312529 |
0.815574 |
| |
24.463648 |
58.11674 |
| |
2.1279E-6 |
2.8538E-8 |
| |
0.7033046 |
45.800007 |
| |
0.8258136 |
48.996998 |
| |
. |
. |
|
Copyright © 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.