Language Reference |
MCD Call |
The MCD subroutie computes the minimum covariance determinant estimator. The MCD call is the robust estimation of multivariate location and scatter, defined by minimizing the determinant of the covariance matrix computed from points. The algorithm for the MCD subroutine is based on the FAST-MCD algorithm given by Rousseeuw and Van Driessen (1999).
These robust locations and covariance matrices can be used to detect multivariate outliers and leverage points. For this purpose, the MCD subroutine provides a table of robust distances.
In the following discussion, is the number of observations and is the number of regressors. The input arguments to the MCD subroutine are as follows:
refers to an options vector with the following components (missing values are treated as default values):
specifies the amount of printed output. Higher option values request additional output and include the output of lower values.
prints no output except error messages.
prints most of the output.
additionally prints case numbers of the observations in the best subset and some basic history of the optimization process.
additionally prints how many subsets result in singular linear systems.
The default is opt[1]=0.
specifies whether the classical, initial, and final robust covariance matrices are printed. The default is opt[2]=0. Note that the final robust covariance matrix is always returned in coef.
specifies whether the classical, initial, and final robust correlation matrices are printed or returned:
does not return or print.
prints the robust correlation matrix.
returns the final robust correlation matrix in coef.
prints and returns the final robust correlation matrix.
specifies the quantile used in the objective function. The default is opt[4]= . If the value of is specified outside the range , it is reset to the closest boundary of this region.
specifies the number of subset generations. This option is the same as described for the LTS subroutines. Due to computer time restrictions, not all subset combinations can be inspected for larger values of and .
When opt[5] is zero or missing:
If , construct up to five disjoint random subsets with sizes as equal as possible, but not to exceed 300. Inside each subset, choose subset combinations of observations.
If , the number of subsets is taken from the following table.
n |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|
500 |
50 |
22 |
17 |
15 |
14 |
0 |
0 |
0 |
0 |
n |
11 |
12 |
13 |
14 |
15 |
|
0 |
0 |
0 |
0 |
0 |
If the number of cases (observations) is smaller than , then all possible subsets are used; otherwise, 500 subsets are chosen randomly. This means that an exhaustive search is performed for opt[5]=. If is larger than , a note is printed in the log file that indicates how many subsets exist.
refers to an matrix of regressors.
Missing values are not permitted in . Missing values in opt cause the default value to be used.
The MCD subroutine returns the following values:
is a column vector that contains the following scalar information:
the quantile used in the objective function
number of subsets generated
number of subsets with singular linear systems
number of nonzero weights
lowest value of the objective function attained (smallest determinant)
Mahalanobis-like distance used in the computation of the lowest value of the objective function
the cutoff value used for the outlier decision
is a matrix with columns that contains the following results in its rows:
location of ellipsoid center
eigenvalues of final robust scatter matrix
the final robust scatter matrix for opt[2]=1 or opt[2]=3
the final robust correlation matrix for opt[3]=1 or opt[3]=3
is a matrix with columns that contains the following results in its rows:
Mahalanobis distances
robust distances based on the final estimates
weights (=1 for small, =0 for large robust distances)
Consider Brownlee (1965) stackloss data used in the example for the MVE subroutine.
For and (three explanatory variables including intercept), you obtain a total of 5,985 different subsets of 4 observations out of 21. If you decide not to specify optn[5], the MCD algorithm chooses random sample subsets, as in the following statements:
/* X1 X2 X3 Y Stackloss data */ aa = { 1 80 27 89 42, 1 80 27 88 37, 1 75 25 90 37, 1 62 24 87 28, 1 62 22 87 18, 1 62 23 87 18, 1 62 24 93 19, 1 62 24 93 20, 1 58 23 87 15, 1 58 18 80 14, 1 58 18 89 14, 1 58 17 88 13, 1 58 18 82 11, 1 58 19 93 12, 1 50 18 89 8, 1 50 18 86 7, 1 50 19 72 8, 1 50 19 79 8, 1 50 20 80 9, 1 56 20 82 15, 1 70 20 91 15 }; a = aa[,2:4]; optn = j(8, 1, .); optn[1] = 2; /* ipri */ optn[2] = 1; /* pcov: print COV */ optn[3] = 1; /* pcor: print CORR */ call mcd(sc, xmcd, dist, optn, a);
A portion of the output is shown in the following figures. Figure 23.171 shows a summary of the MCD algorithm and the final points selected.
Figure 23.172 shows the observations that were chosen that are used to form the robust estimates.
4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 20 |
Figure 23.173 shows the MCD estimators of the location, scatter matrix, and correlation matrix. The MCD scatter matrix is multiplied by a factor to make it consistent with when the data come from a single Gaussian distribution.
Figure 23.174 shows the classical Mahalanobis distances, the robust distances, and the weights identifying the outlying observations (that is, leverage points when explaining with these three regressor variables).
Classical Distances and Robust (Rousseeuw) Distances | |||
---|---|---|---|
Unsquared Mahalanobis Distance and | |||
Unsquared Rousseeuw Distance of Each Observation | |||
N | Mahalanobis Distances | Robust Distances | Weight |
1 | 2.253603 | 12.173282 | 0 |
2 | 2.324745 | 12.255677 | 0 |
3 | 1.593712 | 9.263990 | 0 |
4 | 1.271898 | 1.401368 | 1.000000 |
5 | 0.303357 | 1.420020 | 1.000000 |
6 | 0.772895 | 1.291188 | 1.000000 |
7 | 1.852661 | 1.460370 | 1.000000 |
8 | 1.852661 | 1.460370 | 1.000000 |
9 | 1.360622 | 2.120590 | 1.000000 |
10 | 1.745997 | 1.809708 | 1.000000 |
11 | 1.465702 | 1.362278 | 1.000000 |
12 | 1.841504 | 1.667437 | 1.000000 |
13 | 1.482649 | 1.416724 | 1.000000 |
14 | 1.778785 | 1.988240 | 1.000000 |
15 | 1.690241 | 5.874858 | 0 |
16 | 1.291934 | 5.606157 | 0 |
17 | 2.700016 | 6.133319 | 0 |
18 | 1.503155 | 5.760432 | 0 |
19 | 1.593221 | 6.156248 | 0 |
20 | 0.807054 | 2.172300 | 1.000000 |
21 | 2.176761 | 7.622769 | 0 |
Copyright © SAS Institute, Inc. All Rights Reserved.