LTS Call
performs robust regression
- CALL LTS( sc, coef, wgt, opt, , ,
sorb>);
A robust (resistant) regression method, defined by
minimizing the sum of the
smallest squared
residuals.
The Least Trimmed Squares (LTS) subroutine performs
robust
regression (sometimes called
resistant regression).
It is able to detect outliers and perform a least squares regression on
the remaining observations. Beginning with SAS/IML 8.1, the LTS subroutine
implements a new algorithm, FAST-LTS, given by Rousseeuw and Van Driessen (1998).
The new algorithm is set as the default. The algorithm in previous versions
is temporarily available but will be phased out. See
opt[9] for details.
The value of
can be specified, but for many applications
the default value works just fine and the results seem to be
quite stable toward different choices of
.
In the following discussion,
is the number of observations and
is the number of regressors.
The inputs to the LTS subroutine are as follows:
- opt
- refers to an options vector with the following components
(missing values are treated as default values).
The options vector can be a null vector.
- opt[1]
- specifies whether an intercept is used in the model
(opt[1]=0) or not (opt[1]).
If opt[1]=0, then a column of ones is added as
the last column to the input matrix ; that
is, you do not need to add this column of ones yourself.
The default is opt[1]=0.
- opt[2]
- specifies the amount of printed output.
Higher values request additional output
and include the output of lower values.
- opt[2]=0
- prints no output except error messages.
- opt[2]=1
- prints all output except (1) arrays of ,
such as weights, residuals, and diagnostics; (2)
the history of the optimization process; and (3)
subsets that result in singular linear systems.
- opt[2]=2
- additionally prints arrays of , such
as weights, residuals, and diagnostics;
also prints the case numbers of the
observations in the best subset and some
basic history of the optimization process.
- opt[2]=3
- additionally prints subsets that
result in singular linear systems.
The default is opt[2]=0.
- opt[3]
- specifies whether only LTS is computed or
whether, additionally, least squares (LS) and
weighted least squares (WLS) regression are computed:
- opt[3]=0
- computes only LTS.
- opt[3]=1
- computes, in addition to LTS, weighted
least squares regression on the observations
with small LTS residuals (where
small is defined by opt[8]).
- opt[3]=2
- computes, in addition to LTS,
unweighted least squares regression.
- opt[3]=3
- adds both unweighted and weighted least squares
regression to LTS regression.
The default is opt[3]=0.
- opt[4]
- specifies the quantile to be minimized.
This is used in the objective function.
The default is opt[4], which
corresponds to the highest possible breakdown value.
This is also the default of the PROGRESS program.
The value of should be in the range
- opt[5]
- specifies the number of generated subsets.
Each subset consists of observations
, where .
The total number of subsets consisting of
observations out of observations is
where is the number of
parameters including the intercept.
Due to computer time restrictions, not all subset
combinations of observations out of can
be inspected for larger values of and .
Specifying a value of
enables you to save computer time at the
expense of computing a suboptimal solution.
When opt[5] is zero or missing:
If , the default FAST-LTS algorithm constructs
up to five disjoint random subsets with sizes as equal as
possible, but not to exceed 300. Inside each subset, the algorithm chooses
subset combinations of observations.
For the default FAST-LTS algorithm with or the
previous algorithm (before SAS/IML 8.1), the number of subsets
is taken from the following table.
n
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
| 500 | 50 | 22 | 17 | 15 | 14 | 0 | 0 | 0 | 0 |
| | 1414 | 182 | 71 | 43 | 32 | 27 | 24 | 23 | 22 |
| 500 | 1000 | 1500 | 2000 | 2500 | 3000 | 3000 | 3000 | 3000 | 3000 |
n
|
11
|
12
|
13
|
14
|
15
|
| 0 | 0 | 0 | 0 | 0 |
| 22 | 22 | 22 | 23 | 23 |
| 3000 | 3000 | 3000 | 3000 | 3000 |
-
-
- If the number of cases (observations) is smaller than
, then all possible subsets are used;
otherwise, fixed 500 subsets for FAST-LTS or
subsets for algorithm before SAS/IML 8.1 are chosen randomly.
This means that an exhaustive search
is performed for opt[5]=-1.
If is larger than , a note is printed
in the log file indicating how many subsets exist.
- opt[6]
- is not used.
- opt[7]
- specifies whether the last argument sorb contains a
given parameter vector or a given subset
for which the objective function should be evaluated.
- opt[7]=0
- sorb contains a given subset index.
- opt[7]=1
- sorb contains a given parameter vector .
The default is opt[7]=0.
- opt[8]
- is relevant only for LS and WLS
regression (opt[3] > 0).
It specifies whether the covariance matrix of
parameter estimates and approximate standard
errors (ASEs) are computed and printed.
- opt[8]=0
- does not compute covariance matrix and ASEs.
- opt[8]=1
- computes covariance matrix and ASEs
but prints neither of them.
- opt[8]=2
- computes the covariance matrix and ASEs
but prints only the ASEs.
- opt[8]=3
- computes and prints both the covariance
matrix and the ASEs.
The default is opt[8]=0.
- opt[9]
- is relevant only for LTS. If opt[9]=0, the algorithm
FAST-LTS of Rousseeuw and Van Driessen (1998) is used.
If opt[9] = 1, the algorithm of Rousseeuw and Leroy (1987)
is used. The default is opt[9]=0.
- refers to an response vector y.
- refers to an matrix of regressors.
If opt[1] is zero or missing, an intercept
is added by
default as the last column of .
If the matrix is not specified,
is analyzed as a univariate data
set.
- sorb
- refers to an vector containing either of the following:
- observation numbers of a subset for which
the objective function should be evaluated;
this subset can be the start for a pairwise
exchange algorithm if opt[7] is specified.
- given parameters
(including the intercept, if necessary) for
which the objective function should be evaluated.
Missing values are not permitted in
or
.
Missing values in
opt cause the default value to be used.
The LTS subroutine returns the following values:
- sc
- is a column vector containing the following scalar information,
where rows 1 - 9 correspond to LTS regression and
rows 11 - 14 correspond to either LS or WLS:
- sc[1]
- the quantile used in the objective function
- sc[2]
- number of subsets generated
- sc[3]
- number of subsets with singular linear systems
- sc[4]
- number of nonzero weights
- sc[5]
- lowest value of the objective function
attained
- sc[6]
- preliminary LTS scale estimate
- sc[7]
- final LTS scale estimate
- sc[8]
- robust (coefficient of determination)
- sc[9]
- asymptotic consistency factor
If opt[3] > 0, then the following are also set:
- sc[11]
- LS or WLS objective function (sum of squared residuals)
- sc[12]
- LS or WLS scale estimate
- sc[13]
- value for LS or WLS
- sc[14]
- value for LS or WLS
For opt[3]=1 or opt[3]=3, these rows
correspond to WLS estimates; for opt[3]=2,
these rows correspond to LS estimates.
- coef
- is a matrix with columns containing
the following results in its rows:
- coef[1,]
- LTS parameter estimates
- coef[2,]
- indices of observations in the best subset
If opt[3] > 0, then the following are also set:
- coef[3]
- LS or WLS parameter estimates
- coef[4]
- approximate standard errors of LS or WLS estimates
- coef[5]
- -values
- coef[6]
- -values
- coef[7]
- lower boundary of Wald confidence intervals
- coef[8]
- upper boundary of Wald confidence intervals
For opt[3]=1 or opt[3]=3, these rows correspond
to WLS estimates; for opt[3]=2, to LS estimates.
- wgt
- is a matrix with columns containing
the following results in its rows:
- wgt[1]
- weights (=1 for small, =0 for large residuals)
- wgt[2]
- residuals
- wgt[3]
- resistant diagnostic (note that the resistant
diagnostic cannot be computed for a perfect fit
when the objective function is zero or nearly zero)
Example
Consider Brownlee's (1965) stackloss
data used in the example for the LMS subroutine.
For and (three explanatory variables
including intercept), you obtain a total of 5,985
different subsets of 4 observations out of 21.
If you decide not to specify optn[5],
the FAST-LTS algorithm chooses 500 random sample subsets,
as in the following code:
/* X1 X2 X3 Y Stackloss data */
aa = { 1 80 27 89 42,
1 80 27 88 37,
1 75 25 90 37,
1 62 24 87 28,
1 62 22 87 18,
1 62 23 87 18,
1 62 24 93 19,
1 62 24 93 20,
1 58 23 87 15,
1 58 18 80 14,
1 58 18 89 14,
1 58 17 88 13,
1 58 18 82 11,
1 58 19 93 12,
1 50 18 89 8,
1 50 18 86 7,
1 50 19 72 8,
1 50 19 79 8,
1 50 20 80 9,
1 56 20 82 15,
1 70 20 91 15 };
a = aa[,2:4]; b = aa[,5];
optn = j(8,1,.);
optn[2]= 1; /* ipri */
optn[3]= 3; /* ilsq */
optn[8]= 3; /* icov */
CALL LTS(sc,coef,wgt,optn,b,a);
The preceding program produces the following output:
Least Trimmed Squares (LTS) Method
Minimizing Sum of 13 Smallest Squared Residuals.
Highest Possible Breakdown Value = 42.86 %
Random Selection of 523 Subsets
Among 523 subsets 23 is/are singular.
The best half of the entire data set obtained after full
iteration consists of the cases:
5 6 7 8 9 10 11
12 15 16 17 18 19
Estimated Coefficients
VAR1 VAR2 VAR3 Intercep
0.7409210642 0.3915267228 0.0111345398 -37.32332647
LTS Objective Function = 0.474940583
Preliminary LTS Scale = 0.9888435617
Robust R Squared = 0.973976868
Final LTS Scale = 1.0360272594
For LTS observations, 1, 2, 3, 4, 13, and 21 have scaled residuals
larger than 2.5 (table not shown) and are considered outliers.
Following are the corresponding WLS results:
Weighted Least-Squares Estimation
RLS Parameter Estimates Based on LMS
Approx Pr >
Variable Estimate Std Err t Value |t|
VAR1 0.756940 0.078607 9.63 <.0001
VAR2 0.453530 0.136050 3.33 0.0067
VAR3 -0.05211 0.054637 -0.95 0.3607
Intercep -34.0575 3.828818 -8.90 <.0001
Lower WCI Upper WCI
0.602872 0.911008
0.186876 0.720184
-0.15919 0.054977
-41.5618 -26.5531
Weighted Sum of Squares = 10.273044977
Degrees of Freedom = 11
RLS Scale Estimate = 0.9663918355
Cov Matrix of Parameter Estimates
VAR1 VAR2 VAR3 Intercep
VAR1 0.0061791 -0.005776 -0.002300 -0.034290
VAR2 -0.005776 0.0185096 0.0002582 -0.069740
VAR3 -0.002300 0.0002582 0.0029852 -0.131487
Intercep -0.034290 -0.069740 -0.131487 14.659852
Weighted R-squared = 0.9622869127
F(3,11) Statistic = 93.558645037
Probability = 4.1136826E-8
There are 15 points with nonzero weight.
Average Weight = 0.7142857143
See the entry for the LMS subroutine for details.