Language Reference

LTS Call

performs robust regression

CALL LTS( sc, coef, wgt, opt, y\lt, \lt x\gt\lt, sorb>);

A robust (resistant) regression method, defined by minimizing the sum of the h smallest squared residuals.

The Least Trimmed Squares (LTS) subroutine performs robust regression (sometimes called resistant regression). It is able to detect outliers and perform a least squares regression on the remaining observations. Beginning with SAS/IML 8.1, the LTS subroutine implements a new algorithm, FAST-LTS, given by Rousseeuw and Van Driessen (1998). The new algorithm is set as the default. The algorithm in previous versions is temporarily available but will be phased out. See opt[9] for details.

The value of h can be specified, but for many applications the default value works just fine and the results seem to be quite stable toward different choices of h.

In the following discussion, n is the number of observations and n is the number of regressors. The inputs to the LTS subroutine are as follows:



opt
refers to an options vector with the following components (missing values are treated as default values). The options vector can be a null vector.



opt[1]
specifies whether an intercept is used in the model (opt[1]=0) or not (opt[1]\neq0). If opt[1]=0, then a column of ones is added as the last column to the input matrix x; that is, you do not need to add this column of ones yourself. The default is opt[1]=0.

opt[2]
specifies the amount of printed output. Higher values request additional output and include the output of lower values.



opt[2]=0
prints no output except error messages.

opt[2]=1
prints all output except (1) arrays of o(n), such as weights, residuals, and diagnostics; (2) the history of the optimization process; and (3) subsets that result in singular linear systems.

opt[2]=2
additionally prints arrays of o(n), such as weights, residuals, and diagnostics; also prints the case numbers of the observations in the best subset and some basic history of the optimization process.

opt[2]=3
additionally prints subsets that result in singular linear systems.

The default is opt[2]=0.

opt[3]
specifies whether only LTS is computed or whether, additionally, least squares (LS) and weighted least squares (WLS) regression are computed:



opt[3]=0
computes only LTS.

opt[3]=1
computes, in addition to LTS, weighted least squares regression on the observations with small LTS residuals (where small is defined by opt[8]).

opt[3]=2
computes, in addition to LTS, unweighted least squares regression.

opt[3]=3
adds both unweighted and weighted least squares regression to LTS regression.

The default is opt[3]=0.

opt[4]
specifies the quantile h to be minimized. This is used in the objective function. The default is opt[4]=h=[\frac{n+n+1}2], which corresponds to the highest possible breakdown value. This is also the default of the PROGRESS program. The value of h should be in the range \frac{n}2+1  \leq  h  \leq     \frac{3n}4 + \frac{n+1}4

opt[5]
specifies the number n_{\rm rep} of generated subsets. Each subset consists of n observations (k_1, ... ,k_n), where 1 \leq k_i \leq n. The total number of subsets consisting of n observations out of n observations is
n_{\rm tot} = {n \choose n}    = \frac{\prod_{j=1}^n (n-j+1)}    {\prod_{j=1}^n j}
where n is the number of parameters including the intercept.

Due to computer time restrictions, not all subset combinations of n observations out of n can be inspected for larger values of n and n. Specifying a value of n_{\rm rep} \lt n_{\rm tot} enables you to save computer time at the expense of computing a suboptimal solution.

When opt[5] is zero or missing:

If n\gt 600, the default FAST-LTS algorithm constructs up to five disjoint random subsets with sizes as equal as possible, but not to exceed 300. Inside each subset, the algorithm chooses 500/5=100 subset combinations of n observations.

For the default FAST-LTS algorithm with n\lt 600 or the previous algorithm (before SAS/IML 8.1), the number of subsets is taken from the following table.

n 1 2 3 4 5 6 7 8 9 10
n_{\rm lower}50050221715140000
n_{\rm upper}10^6141418271433227242322
n_{\rm rep}500100015002000250030003000300030003000


n 11 12 13 14 15
n_{\rm lower}00000
n_{\rm upper}2222222323
n_{\rm rep}30003000300030003000






If the number of cases (observations) n is smaller than n_{\rm lower}, then all possible subsets are used; otherwise, fixed 500 subsets for FAST-LTS or n_{\rm rep} subsets for algorithm before SAS/IML 8.1 are chosen randomly. This means that an exhaustive search is performed for opt[5]=-1. If n is larger than n_{\rm upper}, a note is printed in the log file indicating how many subsets exist.

opt[6]
is not used.

opt[7]
specifies whether the last argument sorb contains a given parameter vector b or a given subset for which the objective function should be evaluated.



opt[7]=0
sorb contains a given subset index.
opt[7]=1
sorb contains a given parameter vector b.

The default is opt[7]=0.

opt[8]
is relevant only for LS and WLS regression (opt[3] > 0). It specifies whether the covariance matrix of parameter estimates and approximate standard errors (ASEs) are computed and printed.



opt[8]=0
does not compute covariance matrix and ASEs.

opt[8]=1
computes covariance matrix and ASEs but prints neither of them.

opt[8]=2
computes the covariance matrix and ASEs but prints only the ASEs.

opt[8]=3
computes and prints both the covariance matrix and the ASEs.

The default is opt[8]=0.

opt[9]
is relevant only for LTS. If opt[9]=0, the algorithm FAST-LTS of Rousseeuw and Van Driessen (1998) is used. If opt[9] = 1, the algorithm of Rousseeuw and Leroy (1987) is used. The default is opt[9]=0.


y
refers to an n response vector y.

x
refers to an n x n matrix x of regressors. If opt[1] is zero or missing, an intercept x_{n+1} \equiv 1 is added by default as the last column of x. If the matrix x is not specified, y is analyzed as a univariate data set.

sorb
refers to an n vector containing either of the following:

  • n observation numbers of a subset for which the objective function should be evaluated; this subset can be the start for a pairwise exchange algorithm if opt[7] is specified.
  • n given parameters b=(b_1, ... ,b_n) (including the intercept, if necessary) for which the objective function should be evaluated.

Missing values are not permitted in x or y. Missing values in opt cause the default value to be used.

The LTS subroutine returns the following values:



sc
is a column vector containing the following scalar information, where rows 1 - 9 correspond to LTS regression and rows 11 - 14 correspond to either LS or WLS:



sc[1]
the quantile h used in the objective function

sc[2]
number of subsets generated

sc[3]
number of subsets with singular linear systems

sc[4]
number of nonzero weights w_i

sc[5]
lowest value of the objective function f_{\rm lts} attained

sc[6]
preliminary LTS scale estimate s_p

sc[7]
final LTS scale estimate s_f

sc[8]
robust r^2 (coefficient of determination)

sc[9]
asymptotic consistency factor

If opt[3] > 0, then the following are also set:



sc[11]
LS or WLS objective function (sum of squared residuals)

sc[12]
LS or WLS scale estimate

sc[13]
r^2 value for LS or WLS

sc[14]
f value for LS or WLS

For opt[3]=1 or opt[3]=3, these rows correspond to WLS estimates; for opt[3]=2, these rows correspond to LS estimates.

coef
is a matrix with n columns containing the following results in its rows:



coef[1,]
LTS parameter estimates

coef[2,]
indices of observations in the best subset

If opt[3] > 0, then the following are also set:



coef[3]
LS or WLS parameter estimates

coef[4]
approximate standard errors of LS or WLS estimates

coef[5]
t-values

coef[6]
p-values

coef[7]
lower boundary of Wald confidence intervals

coef[8]
upper boundary of Wald confidence intervals

For opt[3]=1 or opt[3]=3, these rows correspond to WLS estimates; for opt[3]=2, to LS estimates.

wgt
is a matrix with n columns containing the following results in its rows:



wgt[1]
weights (=1 for small, =0 for large residuals)

wgt[2]
residuals r_i = y_i - x_i b

wgt[3]
resistant diagnostic u_i (note that the resistant diagnostic cannot be computed for a perfect fit when the objective function is zero or nearly zero)

Example

Consider Brownlee's (1965) stackloss data used in the example for the LMS subroutine.

For n=21 and n=4 (three explanatory variables including intercept), you obtain a total of 5,985 different subsets of 4 observations out of 21. If you decide not to specify optn[5], the FAST-LTS algorithm chooses 500 random sample subsets, as in the following code:

  
         /* X1  X2  X3   Y    Stackloss data */ 
  aa = { 1  80  27  89  42, 
         1  80  27  88  37, 
         1  75  25  90  37, 
         1  62  24  87  28, 
         1  62  22  87  18, 
         1  62  23  87  18, 
         1  62  24  93  19, 
         1  62  24  93  20, 
         1  58  23  87  15, 
         1  58  18  80  14, 
         1  58  18  89  14, 
         1  58  17  88  13, 
         1  58  18  82  11, 
         1  58  19  93  12, 
         1  50  18  89   8, 
         1  50  18  86   7, 
         1  50  19  72   8, 
         1  50  19  79   8, 
         1  50  20  80   9, 
         1  56  20  82  15, 
         1  70  20  91  15 };
 

  
   a = aa[,2:4]; b = aa[,5]; 
   optn = j(8,1,.); 
   optn[2]= 1;    /* ipri */ 
   optn[3]= 3;    /* ilsq */ 
   optn[8]= 3;    /* icov */ 
  
   CALL LTS(sc,coef,wgt,optn,b,a);
 

The preceding program produces the following output:

  
  
        Least Trimmed Squares (LTS) Method 
    Minimizing Sum of 13 Smallest Squared Residuals. 
      Highest Possible Breakdown Value =  42.86 % 
        Random Selection of 523 Subsets 
         Among 523 subsets 23 is/are singular.
 

  
   The best half of the entire data set obtained after full 
   iteration consists of the cases: 
  
   5        6        7        8        9       10       11 
   12       15       16       17       18      19 
  
  
          Estimated Coefficients 
  
       VAR1          VAR2              VAR3         Intercep 
  
  0.7409210642    0.3915267228    0.0111345398    -37.32332647 
  
  
           LTS Objective Function = 0.474940583 
           Preliminary LTS Scale = 0.9888435617 
           Robust R Squared = 0.973976868 
           Final LTS Scale = 1.0360272594
 

For LTS observations, 1, 2, 3, 4, 13, and 21 have scaled residuals larger than 2.5 (table not shown) and are considered outliers. Following are the corresponding WLS results:

  
  
             Weighted Least-Squares Estimation 
  
  
           RLS Parameter Estimates Based on LMS 
  
  
                             Approx             Pr > 
   Variable    Estimate     Std Err  t Value     |t| 
  
   VAR1        0.756940  0.078607     9.63     <.0001 
   VAR2        0.453530  0.136050     3.33     0.0067 
   VAR3        -0.05211  0.054637    -0.95     0.3607 
   Intercep    -34.0575  3.828818    -8.90     <.0001 
  
  
                    Lower WCI   Upper WCI 
  
                    0.602872    0.911008 
                    0.186876    0.720184 
                    -0.15919    0.054977 
                    -41.5618    -26.5531 
  
               Weighted Sum of Squares = 10.273044977 
                      Degrees of Freedom = 11 
                 RLS Scale Estimate = 0.9663918355
 

  
               Cov Matrix of Parameter Estimates 
  
               VAR1         VAR2         VAR3         Intercep 
  
    VAR1      0.0061791    -0.005776    -0.002300    -0.034290 
    VAR2      -0.005776    0.0185096    0.0002582    -0.069740 
    VAR3      -0.002300    0.0002582    0.0029852    -0.131487 
    Intercep  -0.034290    -0.069740    -0.131487    14.659852 
  
               Weighted R-squared = 0.9622869127 
                F(3,11) Statistic = 93.558645037 
                   Probability = 4.1136826E-8 
            There are 15 points with nonzero weight. 
                 Average Weight = 0.7142857143
 

See the entry for the LMS subroutine for details.

Previous Page | Next Page | Top of Page