# The ROBUSTREG Procedure

### Example 98.5 Robust Diagnostics

This example models the selling price of a house as a function of several covariates. One of these covariates is a classification variable that indicates whether a house is located on a corner lot (called a corner house in this example). Because corner houses are relatively rare, the inclusion of this classification effect in the model introduces a low-dimensional structure (that is, the majority of the observations are located in a lower-dimensional hyperplane that is defined as containing non-corner houses) into the design matrix. As discussed in the section Robust Distance, the presence of this low-dimensional structure causes difficulties in the traditional computation of robust distances. This example illustrates how you can use the projected robust distance to address those difficulties and to obtain meaningful leverage diagnostics. It also shows how you can use the RDPLOT= and DDPLOT= options to illustrate the outlier-leverage relationship.

The following house price data set contains 66 home resale records on seven variables from February 15 to April 30, 1993 (Data and Story Library 2005). The records are randomly selected from a database that is maintained by the Albuquerque Board of Realtors.

```
data house;
input price sqft age feats ne cor tax @@;
label price = "Selling price"
sqft  = "Square feet of living space"
age   = "Age of home in year"
feats = "Number out of 11 features (dishwasher, refrigerator,
microwave, disposer, washer, intercom, skylight(s),
compactor, dryer, handicap fit, cable TV access)"
ne    = "Located in northeast sector of city (1) or not (0)"
cor   = "Corner location (1) or not (0)"
tax   = "Annual taxes";
sum = sqft+age+feats+ne+cor+tax;
id  = _N_;
datalines;
2050 2650 13 7 1 0 1639
2150 2664  6 5 1 0 1193
2150 2921  3 6 1 0 1635
1999 2580  4 4 1 0 1732

... more lines ...

870 1273  4 4 0 0  638
869 1165  7 4 0 0  694
766 1200  7 4 0 1  634
739  970  4 4 0 1  541
;
```

To illustrate the dependence detection ability of the generalized MCD algorithm, an extra variable called `sum` is created such that all the observations satisfy `sum` = `sqft` + `age` + `feats` + `ne` + `cor` + `tax`. Adding the variable `sum` does not change the rank of the original design matrix; `sum` is expected to be ignored in the model and also in the diagnostics. The following statements apply the MM method and the generalized MCD algorithm to the house price data:

```ods graphics on;
proc robustreg data=house method=MM plots=all;
model price = sqft age feats ne cor tax sum /
leverage(opc mcdinfo) diagnostics;
run;
```

As shown in Output 98.5.1 and Output 98.5.2, PROC ROBUSTREG finds the design dependence equation and forces the parameter estimate of variable `sum` to be 0.

Output 98.5.1: MM Estimates

The ROBUSTREG Procedure

Parameter Estimates
Parameter DF Estimate Standard
Error
95% Confidence Limits Chi-Square Pr > ChiSq
Intercept 1 46.4062 79.1714 -108.767 201.5792 0.34 0.5578
sqft 1 0.3809 0.0756 0.2327 0.5291 25.37 <.0001
age 1 -2.6067 1.7610 -6.0582 0.8449 2.19 0.1388
feats 1 8.3627 14.7107 -20.4697 37.1951 0.32 0.5697
ne 1 65.0081 40.1329 -13.6508 143.6671 2.62 0.1053
cor 1 -19.2997 38.1907 -94.1520 55.5526 0.26 0.6133
tax 1 0.4699 0.1260 0.2229 0.7170 13.90 0.0002
sum 0 0.0000 . . . . .
Scale 0 157.5593

Output 98.5.2: Design Dependence Equations

 Note: The following variables have been ignored in the MCD computation because of linear dependence.

sum = sqft + age + feats + ne + cor + tax

Moreover, PROC ROBUSTREG also identifies a robust dependence equation on `cor` in Output 98.5.3, which holds for 77.27% of the observations but not for the entire data set.

Output 98.5.3: Robust Dependence Equations

 Note: The following robust dependence equations simultaneously hold for 77.27% of the observations in the data set. The breakdown setting for the MCD algorithm is 22.73%.

 cor = 0

Another way to represent the low-dimensional structure is to specify the coefficients of the MCD-dropped components on the data (see Output 98.5.4), which form a basis of the complementary space to the relevant low-dimensional hyperplane.

Output 98.5.4: Coefficients of MCD-Dropped Components

Coefficients for MCD-Dropped
Components
Parameter DesignDrop0 RobustDrop1
sqft 0 0
age 0 0
feats 0 0
ne 0 0
cor 0 1.0000
tax 0 0
sum 1.0000 0

By the definitions of projected robust distance and leverage point, an observation is called an off-plane leverage point if at least one of the robust or design dependence equations does not apply to the observation. In this example, the observations in which `cor` = 1 are all off-plane leverage points. Output 98.5.5 lists the leverage points and outliers along with the relevant distance measurements and standardized residuals.

Output 98.5.5: Diagnostics

Diagnostics
Obs Projected Distance Leverage Standardized
Robust
Residual
Outlier
Mahalanobis Robust Off-Plane
1 3.5567 4.0211 0.0000 * 0.8522
13 4.0034 5.2310 0.0000 * 0.1411
15 1.3221 1.5219 2.3681 * 0.0226
16 1.0839 1.0905 2.3681 * 0.4148
18 1.9452 2.4655 2.3681 * -0.2789
20 3.6006 4.0771 2.3681 * -0.0150
22 3.0210 3.4307 2.3681 * 1.1664
23 1.5920 1.8197 2.3681 * 0.2422
24 3.4967 4.5154 0.0000 * 0.6464
26 3.0420 3.6975 0.0000 * -1.7068
29 2.3264 2.9925 2.3681 * -2.4980
30 1.2587 1.2714 2.3681 * -1.2558
38 2.4064 2.7249 2.3681 * -1.0620
42 1.4722 1.4645 2.3681 * 0.2584
44 2.8491 3.0019 0.0000   4.5665 *
46 3.9725 5.2271 0.0000 * 3.5835 *
47 2.9431 3.3728 2.3681 * 0.1365
55 2.2325 2.9590 2.3681 * 0.3217
56 1.7999 1.8119 2.3681 * 0.1715
65 1.8831 2.1822 2.3681 * -0.1990
66 2.2483 2.5673 2.3681 * 0.4134

From Output 98.5.6 and Output 98.5.7, you can see that there is no apparent corner-related difference for the houses in terms of standardized robust residual and projected MD versus projected RD, although all the corner houses are defined as off-plane leverage points.

Output 98.5.6: Projected RD Plot

Output 98.5.7: Projected DD Plot

Output 98.5.8 shows more details of the robust diagnostics. The number of dimensions indicates that six regressors are used in the MCD analysis. Because `sum` is excluded in model fitting, it is ignored in the MCD analysis. The number of robust dropped components equals 1 because `cor` = 1. The number of off-plane points implies the 15 corner-house observations. The reweighted value of H is the number of observations that are finally used to estimate the MCD covariance.

Output 98.5.8: MCD Information

MCD Profile
Number of Dimensions 6
Number of Robust Dropped Components 1
Number of Observations 66
Number of Off-Plane Observations 15
Specified Value of H 51
Reweighted Value of H 47
Breakdown Value 0.2273

MCD Center
Parameter
Name
Parameter Center
sqft sqft 1752.7
age age 12.809
feats feats 4.0426
ne ne 0.6170
cor cor -2E-16
tax tax 895.40
sum sum 2665.6

MCD Covariance
sqft age feats ne cor tax sum
sqft 248870.3 -853.232 147.0347 88.60083 0 148494.5 396747.3
age -853.232 126.2886 -1.18733 1.229417 0 -1251.44 -1978.34
feats 147.0347 -1.18733 0.99815 0.234043 0 87.0259 361.5814
ne 88.60083 1.229417 0.234043 0.241443 0 45.76688 134.42
cor 0 0 0 0 0 0 0
tax 148494.5 -1251.44 87.0259 45.76688 0 106652.5 255147
sum 396747.3 -1978.34 361.5814 134.42 0 255147 650413.7

MCD Correlation
sqft age feats ne cor tax sum
sqft 1 -0.15219 0.295009 0.361446 0 0.911462 0.986126
age -0.15219 1 -0.10575 0.222643 0 -0.34099 -0.21829
feats 0.295009 -0.10575 1 0.476749 0 0.266726 0.448759
ne 0.361446 0.222643 0.476749 1 0 0.285206 0.339204
cor 0 0 0 0 0 0 0
tax 0.911462 -0.34099 0.266726 0.285206 0 1 0.968747
sum 0.986126 -0.21829 0.448759 0.339204 0 0.968747 1

You might speculate that the projected MD and projected RD are equal to the regular MD and RD on the same data set without the variable `cor`. In fact, this is not true. (See Output 98.5.9 and Output 98.5.10 for the RD plot and DD plot of the data set without `cor`.) When `cor` is included in the MODEL statement, it is omitted from the distance calculation, but it is still used for the initial orthonormalization step and the h-subset searching. In this example, inclusion of `cor` causes all the other covariates to be centered separately for corner houses and non-corner houses. However, without `cor`, the centering process does not distinguish corner houses from non-corner houses, and therefore the MCD algorithm can still be influenced by `cor` through the correlation between `cor` and other covariates. The following statements drop the variable `cor` and produce the RD plot and DD plot for the reduced model, which are shown in Output 98.5.9 and Output 98.5.10, respectively:

```proc robustreg data=house method=MM plots=all;
model price = sqft age feats ne tax/leverage(mcdinfo) diagnostics;
run;
ods graphics off;
```

Output 98.5.9: RD Plot for the Reduced Model

Output 98.5.10: DD Plot for the Reduced Model

Compared with Output 98.5.8, Output 98.5.11 shows the changes of the MCD information by removing `cor` from the model. You can see that the corner houses are no longer identified as off-plane points and that the reweighted value of H is increased from 47 to 52. The breakdown value is intact because it depends only on the specified value of H and the total number of observations.

Output 98.5.11: MCD Information for the Reduced Model

MCD Profile
Number of Dimensions 5
Number of Robust Dropped Components 0
Number of Observations 66
Number of Off-Plane Observations 0
Specified Value of H 51
Reweighted Value of H 52
Breakdown Value 0.2273

MCD Center
Parameter
Name
Parameter Center
sqft sqft 1710.9
age age 11.173
feats feats 3.9423
ne ne 0.5962
tax tax 858.10

MCD Covariance
sqft age feats ne tax
sqft 216974.7 681.2327 199.2492 103.0388 107503.1
age 681.2327 64.49887 -0.9506 1.855581 -187.135
feats 199.2492 -0.9506 0.878959 0.152715 114.9076
ne 103.0388 1.855581 0.152715 0.245475 49.98077
tax 107503.1 -187.135 114.9076 49.98077 66558.68

MCD Correlation
sqft age feats ne tax
sqft 1 0.182102 0.456255 0.44647 0.89457
age 0.182102 1 -0.12625 0.466337 -0.09032
feats 0.456255 -0.12625 1 0.328771 0.475075
ne 0.44647 0.466337 0.328771 1 0.391018
tax 0.89457 -0.09032 0.475075 0.391018 1