Robust Regression Examples |
The following example presents comparisons of LMS, V7 LTS, and FAST-LTS. The data analyzed are the stackloss data of Brownlee (1965), which are also used for documenting the L1 regression module. The three explanatory variables correspond to measurements for a plant oxidizing ammonia to nitric acid on 21 consecutive days:
air flow to the plant
cooling water inlet temperature
acid concentration
The response variable gives the permillage of ammonia lost (stackloss). The following data are also given in Rousseeuw and Leroy (1987) and Osborne (1985):
print "Stackloss Data"; aa = { 1 80 27 89 42, 1 80 27 88 37, 1 75 25 90 37, 1 62 24 87 28, 1 62 22 87 18, 1 62 23 87 18, 1 62 24 93 19, 1 62 24 93 20, 1 58 23 87 15, 1 58 18 80 14, 1 58 18 89 14, 1 58 17 88 13, 1 58 18 82 11, 1 58 19 93 12, 1 50 18 89 8, 1 50 18 86 7, 1 50 19 72 8, 1 50 19 79 8, 1 50 20 80 9, 1 56 20 82 15, 1 70 20 91 15 };
Rousseeuw and Leroy (1987) cite a large number of papers in which the preceding data set was analyzed. They state that most researchers "concluded that observations 1, 3, 4, and 21 were outliers" and that some people also reported observation 2 as an outlier.
For and (three explanatory variables including intercept), you obtain a total of 5,985 different subsets of 4 observations out of 21. If you do not specify OPTN[5], the LMS algorithms draw random sample subsets. Since there is a large number of subsets with singular linear systems that you do not want to print, you can choose OPTN[2]=2 for reduced printed output, as in the following:
title2 "***Use 2000 Random Subsets for LMS***"; a = aa[,2:4]; b = aa[,5]; optn = j(9,1,.); optn[2]= 2; /* ipri */ optn[3]= 3; /* ilsq */ optn[8]= 3; /* icov */ call lms(sc,coef,wgt,optn,b,a);
Summary statistics are shown in Output 12.2.1. Output 12.2.2 displays the results of LS regression. Output 12.2.3 displays the LMS results for the 2000 random subsets.
LS Parameter Estimates | ||||||
---|---|---|---|---|---|---|
Variable | Estimate | Approx Std Err |
t Value | Pr > |t| | Lower WCI | Upper WCI |
VAR1 | 0.7156402 | 0.13485819 | 5.31 | <.0001 | 0.45132301 | 0.97995739 |
VAR2 | 1.29528612 | 0.36802427 | 3.52 | 0.0026 | 0.57397182 | 2.01660043 |
VAR3 | -0.1521225 | 0.15629404 | -0.97 | 0.3440 | -0.4584532 | 0.15420818 |
Intercep | -39.919674 | 11.8959969 | -3.36 | 0.0038 | -63.2354 | -16.603949 |
For LMS, observations 1, 3, 4, and 21 have scaled residuals larger than 2.5 (output not shown), and they are considered outliers. Output 12.2.4 displays the corresponding WLS results.
RLS Parameter Estimates Based on LMS | ||||||
---|---|---|---|---|---|---|
Variable | Estimate | Approx Std Err |
t Value | Pr > |t| | Lower WCI | Upper WCI |
VAR1 | 0.79768556 | 0.06743906 | 11.83 | <.0001 | 0.66550742 | 0.9298637 |
VAR2 | 0.57734046 | 0.16596894 | 3.48 | 0.0041 | 0.25204731 | 0.9026336 |
VAR3 | -0.0670602 | 0.06160314 | -1.09 | 0.2961 | -0.1878001 | 0.05367975 |
Intercep | -37.652459 | 4.73205086 | -7.96 | <.0001 | -46.927108 | -28.37781 |
The V7 LTS algorithm is similar to the LMS algorithm. Here is the code:
title2 "***Use 2000 Random Subsets for LTS***"; a = aa[,2:4]; b = aa[,5]; optn = j(9,1,.); optn[2]= 2; /* ipri */ optn[3]= 3; /* ilsq */ optn[8]= 3; /* icov */ optn[9]= 1; /* V7 LTS */ call lts(sc,coef,wgt,optn,b,a);
In addition to observations 1, 3, 4, and 21, which were considered outliers in LMS, observations 2 and 13 for LTS have absolute scaled residuals that are larger (but not as significantly as observations 1, 3, 4, and 21) than 2.5 (output not shown). Therefore, the WLS results based on LTS are different from those based on LMS.
Output 12.2.6 displays the results for the weighted LS regression.
RLS Parameter Estimates Based on LTS | ||||||
---|---|---|---|---|---|---|
Variable | Estimate | Approx Std Err |
t Value | Pr > |t| | Lower WCI | Upper WCI |
VAR1 | 0.75694055 | 0.07860766 | 9.63 | <.0001 | 0.60287236 | 0.91100874 |
VAR2 | 0.45353029 | 0.13605033 | 3.33 | 0.0067 | 0.18687654 | 0.72018405 |
VAR3 | -0.05211 | 0.05463722 | -0.95 | 0.3607 | -0.159197 | 0.054977 |
Intercep | -34.05751 | 3.82881873 | -8.90 | <.0001 | -41.561857 | -26.553163 |
The FAST-LTS algorithm uses only 500 random subsets and gets better optimization results. Here is the code:
title2 "***Use 500 Random Subsets for FAST-LTS***"; a = aa[,2:4]; b = aa[,5]; optn = j(9,1,.); optn[2]= 2; /* ipri */ optn[3]= 3; /* ilsq */ optn[8]= 3; /* icov */ optn[9]= 0; /* FAST-LTS */ call lts(sc,coef,wgt,optn,b,a);
For this example, the two LTS algorithms identify the same outliers; however, the FAST-LTS algorithm uses only 500 random subsets and gets a smaller objective function, as seen in Output 12.2.7. For large data sets, the advantages of the FAST-LTS algorithm are more obvious.
You now report the results of LMS for all different subsets. Here is the code:
title2 "*** Use All 5985 Subsets***"; a = aa[,2:4]; b = aa[,5]; optn = j(9,1,.); optn[2]= 2; /* ipri */ optn[3]= 3; /* ilsq */ optn[5]= -1; /* nrep: all 5985 subsets */ optn[8]= 3; /* icov */ ods select IterHist0 BestSubset EstCoeff; call lms(sc,coef,wgt,optn,b,a);
Output 12.2.8 displays the results for LMS.
Next, report the results of LTS for all different subsets, as follows:
title2 "*** Use All 5985 Subsets***"; a = aa[,2:4]; b = aa[,5]; optn = j(9,1,.); optn[2]= 2; /* ipri */ optn[3]= 3; /* ilsq */ optn[5]= -1; /* nrep: all 5985 subsets */ optn[8]= 3; /* icov */ optn[9]= 1; /* V7 LTS */ ods select IterHist0 BestSubset EstCoeff; call lts(sc,coef,wgt,optn,b,a);
Output 12.2.9 displays the results for LTS.
Copyright © SAS Institute, Inc. All Rights Reserved.