The ROBUSTREG Procedure |
Robust Distance |
The ROBUSTREG procedure uses the robust multivariate location and scatter estimates for leverage-point detection. The procedure computes a robust version of the Mahalanobis distance by using a generalized minimum covariance determinant (MCD) method. The original MCD method was proposed by Rousseeuw (1984).
PROC ROBUSTREG implements a generalized MCD algorithm based on the fast-MCD algorithm formulated by Rousseeuw and Van Driessen (1999), which is similar to the algorithm for least trimmed squares (LTS).
The canonical Mahalanobis distance is defined as
where and are the empirical multivariate location and scatter, respectively. Here excludes the intercept. The relation between the Mahalanobis distance and the hat matrix is
The canonical robust distance is defined as
where and are the robust multivariate location and scatter, respectively, obtained by MCD.
To achieve robustness, the MCD algorithm estimates the covariance of a multivariate data set mainly through an MCD -point subset of the data set. This subset has the smallest sample-covariance determinant among all the possible -subsets. Accordingly, the breakdown value for the MCD algorithm equals . This means the MCD estimate is reliable, even if up to observations in the data set are contaminated.
It is possible that the original data is in dimensional space, but the -point subset that yields the minimum covariance determinant lies in a lower-dimensional hyperplane. Applying the canonical MCD algorithm to such a data set would result in a singular covariance problem (called exact fit in Rousseeuw and Van Driessen (1999)), so that the relevant robust distances cannot be computed. To deal with the singularity problem and provide further leverage point analysis, PROC ROBUSTREG implements a generalized MCD algorithm. See the section Generalized MCD Algorithm for details. The algorithm distinguishes in-(hyper)plane points from off-(hyper)plane points, and performs MCD leverage point analysis in the dimension-reduced space by projecting all points onto the hyperplane.
Low-dimensional structure is often induced by classification covariates. Suppose, in a study with 25 female subjects and 5 male subjects, that gender is the only classification effect. If the breakdown setting is larger than , the canonical MCD algorithm fails, and so does the relevant leverage point analysis. In this case, the MCD -subset would contain only female observations and the constant gender in the -subset would cause the relevant MCD estimate to be singular. The generalized MCD algorithm solves that problem by identifying all male observations as off-plane leverage points, and then carries out the leverage point analysis with all the other covariates being centered separately for female and male groups against their group means.
In general, low-dimensional structure is not necessarily due to classification covariates. Imagine that 80 children are supposed to play on a straight trail (denoted by ), but some adventurous children go off the trail. The following statements generate the children data and the relevant scatter plot.
data children; do i=1 to 80; off_trail=ranuni(321)>.9; x=rannor(111)*ranuni(321); trail_x=(i-40)/80*3; trail_y=trail_x; if off_trail=1 then y=x-1+rannor(321); else y=x; output; end; run;
ods graphics on; proc sgplot data=children; series x=trail_x y=trail_y/lineattrs=(color="red" pattern=4); scatter x=x y=y/group=off_trail; ellipse x=x y=y/alpha=.05 lineattrs=(color="green" pattern=34); run;
Figure 75.17 shows the positions of all the 80 children, the trail (as a red dashed line), and a contour curve of regular Mahalanobis distance centered at the mean position (as a green dotted ellipse). In terms of regular Mahalanobis distance, the associated covariance estimate is not singular, but its relevant leverage point analysis completely ignores the trail (which is the entity of the low-dimensional structure). The children outside of the ellipse are defined as leverage points, but the children off the trail would not be viewed as leverage points unless they have large Mahalanobis distances. As mentioned in Rousseeuw and Van Driessen (1999), the canonical MCD method can find the low-dimensional structure, but it does not provide further robust covariance estimation because the MCD covariance estimate is singular. As an improved version of the canonical MCD method, the generalized MCD method can find the trail, identify the children off the trail as off-plane leverage points, and further execute in-plane leverage analysis. The following statements apply the generalized MCD algorithm on the children data set.
proc robustreg data=children plots=ddplot(label=none); model i = x y/leverage; run; ods graphics off;
Figure 75.18 exactly identifies the equation underlying the trail. The analysis projects off-plane points onto the trail and computes their projected robust distances and projected Mahalanobis distances the same way as is done for the in-plane points.
Figure 75.19 shows the relevant distance-distance plot. Robust distance is typically larger than Mahalanobis distance because the sample covariance can be strongly influenced by unusual points that cause the sample covariance to be larger than the MCD covariance.
Note:The PROC ROBUSTREG step in this example is used to obtain the leverage diagnostics; the response is not relevant for this analysis.
Through the off-plane and in-plane symbols and the horizontal cutoff line in Figure 75.19, you can separate all the children into four groups:
on-trail and close to the MCD mean position
on-trail but far away from the MCD mean position
off-trail but close to the MCD mean position
off-trail and far away from the MCD mean position
The children in the latter three groups are defined as leverage points in PROC ROBUSTREG.
The generalized MCD algorithm follows the same resampling strategy as the canonical MCD algorithm by Rousseeuw and Van Driessen (1999) but with modifications in the following aspects.
Data are orthonormalized before further processing. The orthonormalized covariates, , are defined by , where and are the eigenvector and eigenvalue matrices of (that is, ).
Let
denote the covariance and eigendecomposition for a low-dimensional -subset , where and the eigenvalues satisfy
Then, the rank of equals , and the pseudo-determinant of is defined as . In finite precision arithmetic, is defined as the number of ’s with being larger than a certain tolerance value. You can specify this tolerance with the PTOL suboption of the LEVERAGE option.
Given and as the covariance and center estimates, the projected Mahalanobis distance for is defined as
The generalized algorithm also computes off-plane distance for each as
In finite precision arithmetic, in the previous off-plane formula are truncated to zero if they satisfy
You can tune this by using either the PCUTOFF or the PALPHA suboption of the LEVERAGE option. The points with zero off-plane distances are called in-plane points; otherwise, they are called off-plane points. Analogous to ordering all points in terms of their canonical Mahalanobis distances, with the generalized MCD algorithm the points are first sorted by their off-plane distances, and the points with the same off-plane distance values are further sorted by their projected Mahalanobis distances.
Instead of comparing the determinants of -subset covariance matrices, the generalized algorithm compares both the ranks and pseudo-determinants of the -subset covariance matrices. If the ranks of two matrices are different, the matrix with smaller rank is treated as if its determinant were smaller. If two matrices are of the same rank, they are compared in terms of their pseudo-determinants.
Suppose that the of the minimum determinant is singular. Then the relevant low-dimensional structure or hyperplane can be identified by using the eigendecomposition of . The eigenvectors that correspond to the nonzero eigenvalues form a basis for the low-dimensional hyperplane. The projected off-plane distance (POD) for is defined as the off-plane distance associated with the . To provide further leverage analysis on the low-dimensional hyperplane, every is transformed into , where are the eigenvectors of the . The projected robust distance (PRD) is then computed as the reweighted Mahalanobis distance on all the transformed in-plane points. The off-plane points are assigned zero weights at the reweighting stage, because they are leverage points by definition. The in-plane points are classified into two groups, the normal group and the in-plane leverage group. This classification is made by comparing their projected robust distances with a leverage cutoff value. See the section Leverage Point and Outlier Detection for details. This reweighting process mirrors the one proposed by Rousseeuw and Van Driessen (1999). However, the degree of freedom for the reweighting critical value is replaced by . You can control the critical value with the MCDCUTOFF or the MCDALPHA option.
If the data set under investigation has a low-dimensional structure, you can use two ODS objects, “DependenceEquations” and “MCDDependenceEquations,” to identify the regressors that are linear combinations of other regressors plus certain constants. The equations in “DependenceEquations” hold for the entire data set, while the equations in “MCDDependenceEquations” apply only to the majority of the observations.
By using the OPC suboption of the LEVERAGE option, you can request an ODS table called “DroppedComponents”. This table contains a set of coefficient vectors for regressors, which form a basis of the complementary space for the relevant low-dimensional structure.
Copyright © SAS Institute, Inc. All Rights Reserved.