The CORR Procedure

Hoeffding Dependence Coefficient


Hoeffding’s measure of dependence, D, is a nonparametric measure of association that detects more general departures from independence. The statistic approximates a weighted sum over observations of chi-square statistics for two-by-two classification tables (Hoeffding 1948). Each set of $(x,y)$ values are cut points for the classification. The formula for Hoeffding’s D is

\[ D = 30 \, \frac{(n-2)(n-3)D_1+D_2-2(n-2)D_3}{n(n-1)(n-2)(n-3)(n-4)} \]

where $D_1 =\sum _ i (Q_ i-1)(Q_ i-2)$, $D_2 =\sum _ i (R_ i-1)(R_ i-2)(S_ i-1)(S_ i-2)$, and $D_3 =\sum _ i (R_ i-2)(S_ i-2)(Q_ i-1)$. $R_ i$ is the rank of $x_ i$, $S_ i$ is the rank of $y_ i$, and $Q_ i$ (also called the bivariate rank) is 1 plus the number of points with both x and y values less than the ith point.

A point that is tied on only the x value or y value contributes 1/2 to $Q_ i$ if the other value is less than the corresponding value for the ith point.

A point that is tied on both x and y contributes 1/4 to $Q_ i$. PROC CORR obtains the $Q_ i$ values by first ranking the data. The data are then double sorted by ranking observations according to values of the first variable and reranking the observations according to values of the second variable. Hoeffding’s D statistic is computed using the number of interchanges of the first variable. When no ties occur among data set observations, the D statistic values are between –0.5 and 1, with 1 indicating complete dependence. However, when ties occur, the D statistic might result in a smaller value. That is, for a pair of variables with identical values, the Hoeffding’s D statistic might be less than 1. With a large number of ties in a small data set, the D statistic might be less than –0.5. For more information about Hoeffding’s D, see Hollander and Wolfe (1999).

Probability Values

The probability values for Hoeffding’s D statistic are computed using the asymptotic distribution computed by Blum, Kiefer, and Rosenblatt (1961). The formula is

\[ \frac{(n-1)\pi ^{4}}{60}D + \frac{\pi ^4}{72} \]

which comes from the asymptotic distribution. If the sample size is less than 10, refer to the tables for the distribution of D in Hollander and Wolfe (1999).