Language Reference: LTS Call :: SAS/IML(R) 9.2 User's Guide

Language Reference

LTS Call

performs robust regression

CALL LTS( sc, coef, wgt, opt, $y\lt$ , $\lt x\gt\lt$ , sorb>);

A robust (resistant) regression method, defined by minimizing the sum of the

smallest squared residuals.

The Least Trimmed Squares (LTS) subroutine performs robust regression (sometimes called resistant regression). It is able to detect outliers and perform a least squares regression on the remaining observations. Beginning with SAS/IML 8.1, the LTS subroutine implements a new algorithm, FAST-LTS, given by Rousseeuw and Van Driessen (1998). The new algorithm is set as the default. The algorithm in previous versions is temporarily available but will be phased out. See opt[9] for details.

The value of

can be specified, but for many applications the default value works just fine and the results seem to be quite stable toward different choices of

.

In the following discussion,

is the number of observations and

is the number of regressors. The inputs to the LTS subroutine are as follows:

opt

refers to an options vector with the following components (missing values are treated as default values). The options vector can be a null vector.

opt[1]

specifies whether an intercept is used in the model (opt[1]=0) or not (opt[1] $\neq0$ ). If opt[1]=0, then a column of ones is added as the last column to the input matrix

; that is, you do not need to add this column of ones yourself. The default is opt[1]=0.

opt[2]

specifies the amount of printed output. Higher values request additional output and include the output of lower values.

opt[2]=0: prints no output except error messages.
opt[2]=1: prints all output except (1) arrays of , such as weights, residuals, and diagnostics; (2) the history of the optimization process; and (3) subsets that result in singular linear systems.
opt[2]=2: additionally prints arrays of , such as weights, residuals, and diagnostics; also prints the case numbers of the observations in the best subset and some basic history of the optimization process.
opt[2]=3: additionally prints subsets that result in singular linear systems.

The default is opt[2]=0.

opt[3]

specifies whether only LTS is computed or whether, additionally, least squares (LS) and weighted least squares (WLS) regression are computed:

opt[3]=0: computes only LTS.
opt[3]=1: computes, in addition to LTS, weighted least squares regression on the observations with small LTS residuals (where small is defined by opt[8]).
opt[3]=2: computes, in addition to LTS, unweighted least squares regression.
opt[3]=3: adds both unweighted and weighted least squares regression to LTS regression.

The default is opt[3]=0.

opt[4]

specifies the quantile

to be minimized. This is used in the objective function. The default is opt[4] $=h=[\frac{n+n+1}2]$ , which corresponds to the highest possible breakdown value. This is also the default of the PROGRESS program. The value of

should be in the range $\frac{n}2+1 \leq h \leq \frac{3n}4 + \frac{n+1}4$

opt[5]

specifies the number $n_{\rm rep}$ of generated subsets. Each subset consists of

observations

, where $1 \leq k_i \leq n$ . The total number of subsets consisting of

observations out of

observations is

$n_{\rm tot} = {n \choose n} = \frac{\prod_{j=1}^n (n-j+1)} {\prod_{j=1}^n j}$

where

is the number of parameters including the intercept.

Due to computer time restrictions, not all subset combinations of

observations out of

can be inspected for larger values of

and

. Specifying a value of $n_{\rm rep} \lt n_{\rm tot}$ enables you to save computer time at the expense of computing a suboptimal solution.

When opt[5] is zero or missing:

If $n\gt 600$ , the default FAST-LTS algorithm constructs up to five disjoint random subsets with sizes as equal as possible, but not to exceed 300. Inside each subset, the algorithm chooses

subset combinations of

observations.

For the default FAST-LTS algorithm with $n\lt 600$ or the previous algorithm (before SAS/IML 8.1), the number of subsets is taken from the following table.

n	1	2	3	4	5	6	7	8	9	10
$n_{\rm lower}$	500	50	22	17	15	14	0	0	0	0
$n_{\rm upper}$		1414	182	71	43	32	27	24	23	22
$n_{\rm rep}$	500	1000	1500	2000	2500	3000	3000	3000	3000	3000

n	11	12	13	14	15
$n_{\rm lower}$	0	0	0	0	0
$n_{\rm upper}$	22	22	22	23	23
$n_{\rm rep}$	3000	3000	3000	3000	3000

If the number of cases (observations)

is smaller than $n_{\rm lower}$ , then all possible subsets are used; otherwise, fixed 500 subsets for FAST-LTS or $n_{\rm rep}$ subsets for algorithm before SAS/IML 8.1 are chosen randomly. This means that an exhaustive search is performed for opt[5]=-1. If

is larger than $n_{\rm upper}$ , a note is printed in the log file indicating how many subsets exist.

opt[6]

is not used.

opt[7]

specifies whether the last argument sorb contains a given parameter vector

or a given subset for which the objective function should be evaluated.

opt[7]=0: sorb contains a given subset index.
opt[7]=1: sorb contains a given parameter vector .

The default is opt[7]=0.

opt[8]

is relevant only for LS and WLS regression (opt[3] > 0). It specifies whether the covariance matrix of parameter estimates and approximate standard errors (ASEs) are computed and printed.

opt[8]=0: does not compute covariance matrix and ASEs.
opt[8]=1: computes covariance matrix and ASEs but prints neither of them.
opt[8]=2: computes the covariance matrix and ASEs but prints only the ASEs.
opt[8]=3: computes and prints both the covariance matrix and the ASEs.

The default is opt[8]=0.

opt[9]

is relevant only for LTS. If opt[9]=0, the algorithm FAST-LTS of Rousseeuw and Van Driessen (1998) is used. If opt[9] = 1, the algorithm of Rousseeuw and Leroy (1987) is used. The default is opt[9]=0.

refers to an

response vector y.

refers to an

matrix

of regressors. If opt[1] is zero or missing, an intercept $x_{n+1} \equiv 1$ is added by default as the last column of

. If the matrix

is not specified,

is analyzed as a univariate data set.

sorb

refers to an

vector containing either of the following:

observation numbers of a subset for which the objective function should be evaluated; this subset can be the start for a pairwise exchange algorithm if opt[7] is specified.
given parameters (including the intercept, if necessary) for which the objective function should be evaluated.

Missing values are not permitted in

. Missing values in opt cause the default value to be used.

The LTS subroutine returns the following values:

sc

is a column vector containing the following scalar information, where rows 1 - 9 correspond to LTS regression and rows 11 - 14 correspond to either LS or WLS:

sc[1]: the quantile used in the objective function
sc[2]: number of subsets generated
sc[3]: number of subsets with singular linear systems
sc[4]: number of nonzero weights
sc[5]: lowest value of the objective function $f_{\rm lts}$ attained
sc[6]: preliminary LTS scale estimate
sc[7]: final LTS scale estimate
sc[8]: robust (coefficient of determination)
sc[9]: asymptotic consistency factor

If opt[3] > 0, then the following are also set:

sc[11]: LS or WLS objective function (sum of squared residuals)
sc[12]: LS or WLS scale estimate
sc[13]: value for LS or WLS
sc[14]: value for LS or WLS

For opt[3]=1 or opt[3]=3, these rows correspond to WLS estimates; for opt[3]=2, these rows correspond to LS estimates.

coef

is a matrix with

columns containing the following results in its rows:

coef[1,]: LTS parameter estimates
coef[2,]: indices of observations in the best subset

If opt[3] > 0, then the following are also set:

coef[3]: LS or WLS parameter estimates
coef[4]: approximate standard errors of LS or WLS estimates
coef[5]: -values
coef[6]: -values
coef[7]: lower boundary of Wald confidence intervals
coef[8]: upper boundary of Wald confidence intervals

For opt[3]=1 or opt[3]=3, these rows correspond to WLS estimates; for opt[3]=2, to LS estimates.

wgt

is a matrix with

columns containing the following results in its rows:

wgt[1]: weights (=1 for small, =0 for large residuals)
wgt[2]: residuals
wgt[3]: resistant diagnostic (note that the resistant diagnostic cannot be computed for a perfect fit when the objective function is zero or nearly zero)

Example

Consider Brownlee's (1965) stackloss data used in the example for the LMS subroutine.

For and (three explanatory variables including intercept), you obtain a total of 5,985 different subsets of 4 observations out of 21. If you decide not to specify optn[5], the FAST-LTS algorithm chooses 500 random sample subsets, as in the following code:

  
         /* X1  X2  X3   Y    Stackloss data */ 
  aa = { 1  80  27  89  42, 
         1  80  27  88  37, 
         1  75  25  90  37, 
         1  62  24  87  28, 
         1  62  22  87  18, 
         1  62  23  87  18, 
         1  62  24  93  19, 
         1  62  24  93  20, 
         1  58  23  87  15, 
         1  58  18  80  14, 
         1  58  18  89  14, 
         1  58  17  88  13, 
         1  58  18  82  11, 
         1  58  19  93  12, 
         1  50  18  89   8, 
         1  50  18  86   7, 
         1  50  19  72   8, 
         1  50  19  79   8, 
         1  50  20  80   9, 
         1  56  20  82  15, 
         1  70  20  91  15 };

  
   a = aa[,2:4]; b = aa[,5]; 
   optn = j(8,1,.); 
   optn[2]= 1;    /* ipri */ 
   optn[3]= 3;    /* ilsq */ 
   optn[8]= 3;    /* icov */ 
  
   CALL LTS(sc,coef,wgt,optn,b,a);

The preceding program produces the following output:

  
  
        Least Trimmed Squares (LTS) Method 
    Minimizing Sum of 13 Smallest Squared Residuals. 
      Highest Possible Breakdown Value =  42.86 % 
        Random Selection of 523 Subsets 
         Among 523 subsets 23 is/are singular.

  
   The best half of the entire data set obtained after full 
   iteration consists of the cases: 
  
   5        6        7        8        9       10       11 
   12       15       16       17       18      19 
  
  
          Estimated Coefficients 
  
       VAR1          VAR2              VAR3         Intercep 
  
  0.7409210642    0.3915267228    0.0111345398    -37.32332647 
  
  
           LTS Objective Function = 0.474940583 
           Preliminary LTS Scale = 0.9888435617 
           Robust R Squared = 0.973976868 
           Final LTS Scale = 1.0360272594

For LTS observations, 1, 2, 3, 4, 13, and 21 have scaled residuals larger than 2.5 (table not shown) and are considered outliers. Following are the corresponding WLS results:

  
  
             Weighted Least-Squares Estimation 
  
  
           RLS Parameter Estimates Based on LMS 
  
  
                             Approx             Pr > 
   Variable    Estimate     Std Err  t Value     |t| 
  
   VAR1        0.756940  0.078607     9.63     <.0001 
   VAR2        0.453530  0.136050     3.33     0.0067 
   VAR3        -0.05211  0.054637    -0.95     0.3607 
   Intercep    -34.0575  3.828818    -8.90     <.0001 
  
  
                    Lower WCI   Upper WCI 
  
                    0.602872    0.911008 
                    0.186876    0.720184 
                    -0.15919    0.054977 
                    -41.5618    -26.5531 
  
               Weighted Sum of Squares = 10.273044977 
                      Degrees of Freedom = 11 
                 RLS Scale Estimate = 0.9663918355

  
               Cov Matrix of Parameter Estimates 
  
               VAR1         VAR2         VAR3         Intercep 
  
    VAR1      0.0061791    -0.005776    -0.002300    -0.034290 
    VAR2      -0.005776    0.0185096    0.0002582    -0.069740 
    VAR3      -0.002300    0.0002582    0.0029852    -0.131487 
    Intercep  -0.034290    -0.069740    -0.131487    14.659852 
  
               Weighted R-squared = 0.9622869127 
                F(3,11) Statistic = 93.558645037 
                   Probability = 4.1136826E-8 
            There are 15 points with nonzero weight. 
                 Average Weight = 0.7142857143

See the entry for the LMS subroutine for details.

Top of Page