Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Changes and Enhancements

LTS Call

performs robust regression

CALL LTS( sc, coef, wgt, opt, y<, <x><, sorb>>);

A new algorithm, FAST-LTS, was added to the LTS subroutine in SAS/IML Release 8.1. The FAST-LTS algorithm is set as the default algorithm. The original algorithm is kept for convenience, and can be used temporarily by specifying optn[9]=1. Eventually the original algorithm will be replaced with the new FAST-LTS algorithm.

The original algorithm for the LTS subroutine and the algorithm used in the LMS subroutine are based on the PROGRESS program by Rousseeuw and Leroy (1987). Rousseeuw and Hubert (1996) prepared a new version of PROGRESS to facilitate its inclusion in SAS software, and they have incorporated several recent developments. Among other things, the new version of PROGRESS now yields the exact LMS for simple regression, and the program uses a new definition of the robust coefficient of determination (R2). Therefore, the outputs may differ slightly from those given in Rousseeuw and Leroy (1987) or those obtained from software based on the older version of PROGRESS.

FAST-LTS

Least trimmed squares (LTS) regression is based on the subset of h cases (out of n) whose least squares fit possesses the smallest sum of squared residuals. The coverage h may be set between n/2 and n. The LTS method was proposed by Rousseeuw (1984, p. 876) as a highly robust regression estimator, with breakdown value (n-h)/n. It turned out that the computation time of the previous LTS algorithm grew too fast with the size of the data set, precluding their use for data mining. Rousseeuw and Van Driessen (1998) developed a new algorithm called FAST-LTS. The basic idea is an inequality involving order statistics and sums of squared residuals. Based on this inequality, techniques called "selective iteration" and "nested extensions" are developed in Rousseeuw and Van Driessen (1998). The new LTS algorithm implements these techniques to achieve faster computation. The intercept adjustment technique is also used in this new algorithm. For small data sets, FAST-LTS typically finds the exact LTS, whereas for larger data sets it gives more accurate results than the previous LTS algorithm and is faster by orders of magnitude. The new algorithm is described briefly as follows; refer to Rousseeuw and Van Driessen (1998) for details:

  1. The default h is (n+p+1)/2, where p is the number of the independent variables. You can choose any integer h with [(n+p+1)/2] \leq h \leq n. The LTS's breakdown point (n-h+1)/n is reported. If you are sure that the data contains less than 25% of contamination, you can obtain a good compromise between breakdown value and statistical efficiency by putting h=[.75n].
  2. If p=1 (univariate data), then compute the LTS estimator by the exact algorithm of Rousseeuw and Leroy (1987, pp. 171-172) and stop.
  3. From here on, p\geq 2. If n<600, draw a random p-subset and compute the regression coefficients using these p points (if the regression is degenerate, draw another p-subset). Compute the absolute residuals for all points in the data set and select the first h points with smallest absolute residuals. From this selected h-subset, carry out C-steps (Concentration step; refer to Rousseeuw and Van Driessen [1998] for details) until convergence. Repeat this procedure 500 or {n\choose p} times (as determined by opt[5] of LTS Call) and find the ten (at most) solutions with the lowest sums of h squared residuals. For each of these ten best solutions, take C-steps until convergence and find the best final solution.
  4. If n > 600, construct up to five disjoint random subsets with sizes as equal as possible, but not to exceed 300. Inside each subset, repeat the procedure in step 3 500/5 = 100 times and keep the ten best solutions. Pool the subsets, yielding the merged set of size nmerged. In the merged set, for each of the 5×10 = 50 best solutions, carry out two C-steps using nmerged and hmerged = [nmerged (h/n)] and keep the ten best solutions. In the full data set, for each of these ten best solutions, take C-steps using n and h until convergence and find the best final solution.

OPTION Changes

The previous LTS algorithm is used if optn[9] = 1; the FAST-LTS algorithm is set as default (or optn[9] = 0).

OUTPUT Changes

Because of the change in algorithm, the output from the FAST-LTS algorithm is different:

  1. The "Complete Enumeration for LTS" table and the "Resistant Diagnostic" table do not apply for the FAST-LTS algorithm and are not displayed.
  2. The analysis based on "Observations of Best Subset" of size p, where p is the number of the independent variables, is replaced by the analysis based on observations of the best h of the entire data set obtained after full iteration, which gives the exact LTS estimator and covariances for small data set ({n \choose p} \lt 500 ). LTS Objective Function, Preliminary LTS Scale, Robust R Squared, and Final LTS Scale are also reported for the LTS estimator.
  3. The LTS residuals are changed, because of the change of the LTS estimator.
  4. The "Coef" vector does not include the best subset.
See the following illustrative example for details.

Illustrative Example

The following example shows the difference between the two algorithms. The second output is generated by the FAST-LTS algorithm.

title1 'Compare two algorithms for LTS';

proc iml;
  reset noname;

  x = {42, 37, 37, 28, 18, 18, 19, 20, 15};
  y = {80, 80, 75, 62, 62, 62, 62, 62, 58};
  optn = j(9, 1, .);
  optn[1]= 0;    /* --- with intercept      --- */
  optn[2]= 4;    /* --- print all output    --- */
  optn[3]= 3;    /* --- compute LS and WLS  --- */
  optn[8]= 3;    /* --- covariance matrices --- */
  optn[9]= 1;    /* --- Version 7 LTS       --- */

  call lts(sc, coef, wgt, optn, y, x);

  print "sc is " sc, "coef is " coef;

  optn[9]= 0;    /* --- FAST-LTS algorithm   --- */

  call lts(sc, coef, wgt, optn, y, x);

  print "sc is " sc, "coef is " coef;

quit;

Comparison of the outputs is summarized as follows:

  1. The summary statistics for dependent and independent variables and the results of the classical (unweighted) least-squares estimation (outputs in Page 1 and first half of Page 2) do not change.
  2. The "Complete Enumeration for LTS" table on page 1 and the "Resistant Diagnostic" table on page 3 of the first output is eliminated. The analysis based on "Observations of Best Subset" of size p, where p is the number of the independent variables, is replaced by the analysis based on observations of the best h of the entire data set obtained after full iteration.
  3. The best ten (at most) estimates before the final search are available by setting a proper option (see pages 2 and 3 of the second output). The first one is the best with the smallest objective value.
  4. The results of the weighted least squared estimation (page 4 for the first output, page 5 for the second output) do not change because the two algorithms detect the same outliers in this simple example. Results will be changed if they detect different outliers.
  5. The "Coef" vector does not include the best subset (page 6 of the second output).

Output from Previous Algorithm for LTS

 
Compare two algorithms for LTS

LTS: The sum of the 6 smallest squared residuals will be minimized.

Median and Mean
  Median Mean
VAR1 20 26
Intercep 1 1
Response 62 67
 
Dispersion and Standard Deviation
  Dispersion StdDev
VAR1 7.4130110925 10.22252415
Intercep 0 0
Response 7.3806276975 8.7177978871

Unweighted Least-Squares Estimation

 

LS Parameter Estimates
Variable Estimate Approx
Std Err
t Value Pr > |t| Lower WCI Upper WCI
VAR1 0.80502392 0.10637482 7.57 0.0001 0.59653312 1.01351473
Intercep 46.069378 2.94965086 15.62 <.0001 40.2881685 51.8505874

Sum of Squares = 66.218899522

Degrees of Freedom = 7

LS Scale Estimate = 3.0756857429

 

Cov Matrix of Parameter Estimates
  VAR1 Intercep
VAR1 0.0113156014 -0.294205637
Intercep -0.294205637 8.7004402045

R-squared = 0.8910873363

F(1,7) Statistic = 57.271681208

Probability = 0.0001297174


 
LS Residuals
N Observed Estimated Residual Res / S
1 80.000000 79.880383 0.119617 0.038891
2 80.000000 75.855263 4.144737 1.347581
3 75.000000 75.855263 -0.855263 -0.278072
4 62.000000 68.610048 -6.610048 -2.149130
5 62.000000 60.559809 1.440191 0.468251
6 62.000000 60.559809 1.440191 0.468251
7 62.000000 61.364833 0.635167 0.206512
8 62.000000 62.169856 -0.169856 -0.055226
9 58.000000 58.144737 -0.144737 -0.047058

Distribution of Residuals

 

MinRes 1st Qu. Median Mean 3rd Qu. MaxRes
-6.610047847 -0.512559809 0.1196172249 -2.36848E-15 1.0376794258 4.1447368421

There are 36 subsets of 2 cases out of 9 cases.

All 36 subsets will be considered.

Complete Enumeration for LTS

 

Subset Singular Best
Criterion
Percent
10 1 0.084493 27
18 1 0.084493 50
28 2 0.084493 77
36 2 0.084493 100

Minimum Criterion= 0.0844927545

Least Trimmed Squares (LTS) Method

Minimizing Sum of 6 Smallest Squared Residuals.

Highest Possible Breakdown Value = 44.44 %

Selection of All 36 Subsets of 2 Cases Out of 9

Among 36 subsets 2 is/are singular.

 

Observations of Best Subset
1 5
 
Estimated Coefficients
VAR1 Intercep
0.75 47.916666667

LTS Objective Function = 0.6236095645

Preliminary LTS Scale = 1.189465671

Robust R Squared = 0.825

Final LTS Scale = 0.8595864639

 

LTS Residuals
N Observed Estimated Residual Res / S
1 80.000000 79.416667 0.583333 0.678621
2 80.000000 75.666667 4.333333 5.041184
3 75.000000 75.666667 -0.666667 -0.775567
4 62.000000 68.916667 -6.916667 -8.046505
5 62.000000 61.416667 0.583333 0.678621
6 62.000000 61.416667 0.583333 0.678621
7 62.000000 62.166667 -0.166667 -0.193892
8 62.000000 62.916667 -0.916667 -1.066404
9 58.000000 59.166667 -1.166667 -1.357242

Distribution of Residuals

 

MinRes 1st Qu. Median Mean 3rd Qu. MaxRes
-6.916666667 -1.041666667 -0.166666667 -0.416666667 0.5833333333 4.3333333333
 
Resistant Diagnostic
N U Resistant
Diagnostic
1 12.521981 5.600000
2 12.521981 5.600000
3 9.167879 4.100000
4 13.339459 5.965588
5 1.709204 0.764379
6 1.709204 0.764379
7 0.642081 0.287147
8 1.697749 0.759257
9 2.236068 1.000000

Median(U)= 2.2360679775


Weighted Least-Squares Estimation

 

RLS Parameter Estimates Based on LTS
Variable Estimate Approx
Std Err
t Value Pr > |t| Lower WCI Upper WCI
VAR1 0.76455907 0.03125286 24.46 <.0001 0.70330458 0.82581356
Intercep 47.3985025 0.815574 58.12 <.0001 45.8000068 48.9969982

Weighted Sum of Squares = 3.3544093178

Degrees of Freedom = 5

RLS Scale Estimate = 0.819073784

 

Cov Matrix of Parameter Estimates
  VAR1 Intercep
VAR1 0.0009767415 -0.02358133
Intercep -0.02358133 0.6651609492

Weighted R-squared = 0.9917145853

F(1,5) Statistic = 598.47009637

Probability = 2.1279132E-6

There are 7 points with nonzero weight.

Average Weight = 0.7777777778

 

Weighted LS Residuals
N Observed Estimated Residual Res / S Weight
1 80.000000 79.509983 0.490017 0.598257 1.000000
2 80.000000 75.687188 4.312812 5.265474 0
3 75.000000 75.687188 -0.687188 -0.838982 1.000000
4 62.000000 68.806156 -6.806156 -8.309577 0
5 62.000000 61.160566 0.839434 1.024858 1.000000
6 62.000000 61.160566 0.839434 1.024858 1.000000
7 62.000000 61.925125 0.074875 0.091414 1.000000
8 62.000000 62.689684 -0.689684 -0.842029 1.000000
9 58.000000 58.866889 -0.866889 -1.058377 1.000000

Distribution of Residuals

 

MinRes 1st Qu. Median Mean 3rd Qu. MaxRes
-6.806156406 -0.77828619 0.074875208 -0.27703827 0.6647254576 4.31281198

The run has been executed successfully.

 

sc is 6
  36
  2
  7
  0.6236096
  1.1894657
  0.8595865
  0.825
  1.9073884
  .
  0.8190738
  3.3544093
  0.9917146
  598.4701
  .
  .
  .
  .
  .
  .
 
coef is 0.75 47.916667
  1 5
  0.7645591 47.398502
  0.0312529 0.815574
  24.463648 58.11674
  2.1279E-6 2.8538E-8
  0.7033046 45.800007
  0.8258136 48.996998

Output from FAST-LTS Algorithm

 
Compare two algorithms for LTS

LTS: The sum of the 6 smallest squared residuals will be minimized.

Median and Mean
  Median Mean
VAR1 20 26
Intercep 1 1
Response 62 67
 
Dispersion and Standard Deviation
  Dispersion StdDev
VAR1 7.4130110925 10.22252415
Intercep 0 0
Response 7.3806276975 8.7177978871

Unweighted Least-Squares Estimation

 

LS Parameter Estimates
Variable Estimate Approx
Std Err
t Value Pr > |t| Lower WCI Upper WCI
VAR1 0.80502392 0.10637482 7.57 0.0001 0.59653312 1.01351473
Intercep 46.069378 2.94965086 15.62 <.0001 40.2881685 51.8505874

Sum of Squares = 66.218899522

Degrees of Freedom = 7

LS Scale Estimate = 3.0756857429

 

Cov Matrix of Parameter Estimates
  VAR1 Intercep
VAR1 0.0113156014 -0.294205637
Intercep -0.294205637 8.7004402045

R-squared = 0.8910873363

F(1,7) Statistic = 57.271681208

Probability = 0.0001297174

 

LS Residuals
N Observed Estimated Residual Res / S
1 80.000000 79.880383 0.119617 0.038891
2 80.000000 75.855263 4.144737 1.347581
3 75.000000 75.855263 -0.855263 -0.278072
4 62.000000 68.610048 -6.610048 -2.149130
5 62.000000 60.559809 1.440191 0.468251
6 62.000000 60.559809 1.440191 0.468251
7 62.000000 61.364833 0.635167 0.206512
8 62.000000 62.169856 -0.169856 -0.055226
9 58.000000 58.144737 -0.144737 -0.047058

Distribution of Residuals

 

MinRes 1st Qu. Median Mean 3rd Qu. MaxRes
-6.610047847 -0.512559809 0.1196172249 -2.36848E-15 1.0376794258 4.1447368421

Least Trimmed Squares (LTS) Method

The (at most) 10 Best Estimates

Objective Value [1]: 0.0428203092

 

Estimated Coefficients
VAR1 Intercep
0.7521545304 0.1250675364

Objective Value [2]: 0.0454534796

 

Estimated Coefficients
VAR1 Intercep
0.7773132092 0.0679381212

Objective Value [3]: 0.0458503276

 

Estimated Coefficients
VAR1 Intercep
0.7857569344 0.0875053435

Objective Value [4]: 0.0470161862

 

Estimated Coefficients
VAR1 Intercep
0.7454450208 0.1033087106

Objective Value [5]: 0.0504563846

 

Estimated Coefficients
VAR1 Intercep
0.987904817 0.1606655488

Objective Value [6]: 0.1790484378

 

Estimated Coefficients
VAR1 Intercep
0.1926222834 -0.081665104

Objective Value [7]: 1.797693E308

Least Trimmed Squares (LTS) Method

Minimizing Sum of 6 Smallest Squared Residuals.

Highest Possible Breakdown Value = 44.44 %

Selection of All 36 Subsets of 2 Cases Out of 9

Among 36 subsets 2 is/are singular.

The best half of the entire data set obtained after full iteration consists of the cases:

 

1 3 5 6 7 8
 
Estimated Coefficients
VAR1 Intercep
0.7488687783 47.945701357

LTS Objective Function = 0.6235087791

Preliminary LTS Scale = 1.1892734341

Robust R Squared = 0.819730444

Final LTS Scale = 0.8627851118

 

LTS Residuals
N Observed Estimated Residual Res / S
1 80.000000 79.398190 0.601810 0.697520
2 80.000000 75.653846 4.346154 5.037354
3 75.000000 75.653846 -0.653846 -0.757832
4 62.000000 68.914027 -6.914027 -8.013614
5 62.000000 61.425339 0.574661 0.666053
6 62.000000 61.425339 0.574661 0.666053
7 62.000000 62.174208 -0.174208 -0.201914
8 62.000000 62.923077 -0.923077 -1.069880
9 58.000000 59.178733 -1.178733 -1.366195

Distribution of Residuals

 

MinRes 1st Qu. Median Mean 3rd Qu. MaxRes
-6.914027149 -1.050904977 -0.174208145 -0.416289593 0.5746606335 4.3461538462

Weighted Least-Squares Estimation

 

RLS Parameter Estimates Based on LTS
Variable Estimate Approx
Std Err
t Value Pr > |t| Lower WCI Upper WCI
VAR1 0.76455907 0.03125286 24.46 <.0001 0.70330458 0.82581356
Intercep 47.3985025 0.815574 58.12 <.0001 45.8000068 48.9969982

Weighted Sum of Squares = 3.3544093178

Degrees of Freedom = 5

RLS Scale Estimate = 0.819073784

 

Cov Matrix of Parameter Estimates
  VAR1 Intercep
VAR1 0.0009767415 -0.02358133
Intercep -0.02358133 0.6651609492

Weighted R-squared = 0.9917145853

F(1,5) Statistic = 598.47009637

Probability = 2.1279132E-6

There are 7 points with nonzero weight.

Average Weight = 0.7777777778

 

Weighted LS Residuals
N Observed Estimated Residual Res / S Weight
1 80.000000 79.509983 0.490017 0.598257 1.000000
2 80.000000 75.687188 4.312812 5.265474 0
3 75.000000 75.687188 -0.687188 -0.838982 1.000000
4 62.000000 68.806156 -6.806156 -8.309577 0
5 62.000000 61.160566 0.839434 1.024858 1.000000
6 62.000000 61.160566 0.839434 1.024858 1.000000
7 62.000000 61.925125 0.074875 0.091414 1.000000
8 62.000000 62.689684 -0.689684 -0.842029 1.000000
9 58.000000 58.866889 -0.866889 -1.058377 1.000000

Distribution of Residuals

 

MinRes 1st Qu. Median Mean 3rd Qu. MaxRes
-6.806156406 -0.77828619 0.074875208 -0.27703827 0.6647254576 4.31281198

The run has been executed successfully.

 

sc is 6
  36
  2
  7
  0.6235088
  1.1892734
  0.8627851
  0.8197304
  1.879
  .
  0.8190738
  3.3544093
  0.9917146
  598.4701
  .
  .
  .
  .
  .
  .
 
coef is 0.7488688 47.945701
  0.7645591 47.398502
  0.0312529 0.815574
  24.463648 58.11674
  2.1279E-6 2.8538E-8
  0.7033046 45.800007
  0.8258136 48.996998
  . .

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.