The FREQ Procedure

Measures of Association

When you specify the MEASURES option in the TABLES statement, PROC FREQ computes several statistics that describe the association between the row and column variables of the contingency table. The following are measures of ordinal association that consider whether the column variable Y tends to increase as the row variable X increases: gamma, Kendall’s tau-b, Stuart’s tau-c, and Somers’ D. These measures are appropriate for ordinal variables, and they classify pairs of observations as concordant or discordant. A pair is concordant if the observation with the larger value of X also has the larger value of Y. A pair is discordant if the observation with the larger value of X has the smaller value of Y. See Agresti (2007) and the other references cited for the individual measures of association.

The Pearson correlation coefficient and the Spearman rank correlation coefficient are also appropriate for ordinal variables. The Pearson correlation describes the strength of the linear association between the row and column variables, and it is computed by using the row and column scores specified by the SCORES= option in the TABLES statement. The Spearman correlation is computed with rank scores. The polychoric correlation (requested by the PLCORR option) also requires ordinal variables and assumes that the variables have an underlying bivariate normal distribution. The following measures of association do not require ordinal variables and are appropriate for nominal variables: lambda asymmetric, lambda symmetric, and the uncertainty coefficients.

PROC FREQ computes estimates of the measures according to the formulas given in the following sections. For each measure, PROC FREQ computes an asymptotic standard error (ASE), which is the square root of the asymptotic variance denoted by Var in the following sections.

Confidence Limits

If you specify the CL option in the TABLES statement, PROC FREQ computes asymptotic confidence limits for all MEASURES statistics. The confidence coefficient is determined according to the value of the ALPHA= option, which, by default, equals 0.05 and produces 95% confidence limits.

The confidence limits are computed as

$\mr{Est} ~ \pm ~ (~ z_{\alpha /2} \times \mr{ASE} ~ )$

where Est is the estimate of the measure, $z_{\alpha /2}$ is the $100(1-\alpha /2)$ percentile of the standard normal distribution, and ASE is the asymptotic standard error of the estimate.

Asymptotic Tests

For each measure that you specify in the TEST statement, PROC FREQ computes an asymptotic test of the null hypothesis that the measure equals zero. Asymptotic tests are available for the following measures of association: gamma, Kendall’s tau-b, Stuart’s tau-c, Somers’ $D(C|R)$ , Somers’ $D(R|C)$ , the Pearson correlation coefficient, and the Spearman rank correlation coefficient. To compute an asymptotic test, PROC FREQ uses a standardized test statistic z, which has an asymptotic standard normal distribution under the null hypothesis. The test statistic is computed as

$z = \mr{Est} ~ / ~ \sqrt {\mr{Var}_0(\mr{Est})}$

where Est is the estimate of the measure and $\mr{Var}_0(\mr{Est})$ is the variance of the estimate under the null hypothesis. Formulas for $\mr{Var}_0(\mr{Est})$ for the individual measures of association are given in the following sections.

Note that the ratio of Est to $\sqrt {\mr{Var}_0(\mr{Est})}$ is the same for the following measures: gamma, Kendall’s tau-b, Stuart’s tau-c, Somers’ $D(C|R)$ , and Somers’ $D(R|C)$ . Therefore, the tests for these measures are identical. For example, the p-values for the test of $H_0\colon \mr{gamma} = 0$ equal the p-values for the test of $H_0\colon \mr{tau}-b = 0$ .

PROC FREQ computes one-sided and two-sided p-values for each of these tests. When the test statistic z is greater than its null hypothesis expected value of zero, PROC FREQ displays the right-sided p-value, which is the probability of a larger value of the statistic occurring under the null hypothesis. A small right-sided p-value supports the alternative hypothesis that the true value of the measure is greater than zero. When the test statistic is less than or equal to zero, PROC FREQ displays the left-sided p-value, which is the probability of a smaller value of the statistic occurring under the null hypothesis. A small left-sided p-value supports the alternative hypothesis that the true value of the measure is less than zero. The one-sided p-value $P_1$ can be expressed as

$\begin{equation*} P_1 = \begin{cases} \mr{Prob} (Z > z) \quad \mr{if} \hspace{.1in} z > 0 \\ \mr{Prob} (Z < z) \quad \mr{if} \hspace{.1in} z \leq 0 \\ \end{cases}\end{equation*}$

where Z has a standard normal distribution. The two-sided p-value $P_{2}$ is computed as

$P_{2} = \mr{Prob} (|Z| > |z|)$

Exact Tests

Exact tests are available for the following measures of association: Kendall’s tau-b, Stuart’s tau-c, Somers’ $D (C|R)$ and $(R|C)$ , the Pearson correlation coefficient, and the Spearman rank correlation coefficient. If you request an exact test for a measure of association in the EXACT statement, PROC FREQ computes the exact test of the hypothesis that the measure equals zero. See the section Exact Statistics for details.

Gamma

The gamma ( $\Gamma$ ) statistic is based only on the number of concordant and discordant pairs of observations. It ignores tied pairs (that is, pairs of observations that have equal values of X or equal values of Y). Gamma is appropriate only when both variables lie on an ordinal scale. The range of gamma is $-1 \leq \Gamma \leq 1$ . If the row and column variables are independent, then gamma tends to be close to zero. Gamma is computed as

$G = (P - Q) ~ / ~ (P + Q)$

and the asymptotic variance is

$\mr{Var}(G) = \frac{16}{(P + Q)^4} \sum _ i \sum _ j n_{ij} (QA_{ij} - PD_{ij})^2$

For $2 \times 2$ tables, gamma is equivalent to Yule’s Q. See Goodman and Kruskal (1979) and Agresti (2002) for more information.

The variance under the null hypothesis that gamma equals zero is computed as

$\mr{Var}_0(G) = \frac{4}{(P+Q)^2} \left( \sum _ i \sum _ j n_{ij} (A_{ij}-D_{ij})^2 - (P-Q)^2/n \right)$

See Brown and Benedetti (1977) for details.

Kendall’s Tau-b

Kendall’s tau-b ( $\tau _ b$ ) is similar to gamma except that tau-b uses a correction for ties. Tau-b is appropriate only when both variables lie on an ordinal scale. The range of tau-b is $-1 \leq \tau _ b \leq 1$ . Kendall’s tau-b is computed as

$t_ b = (P - Q) ~ / ~ \sqrt {w_ r w_ c}$

and the asymptotic variance is

$\mr{Var}(t_ b) = \frac{1}{w^4} \left( \sum _ i \sum _ j n_{ij} (2wd_{ij} + t_ b v_{ij})^2 - n^3 t_ b^2 (w_ r + w_ c)^2 \right)$

where

$\begin{eqnarray*} w & = & \sqrt {w_ r w_ c} \\[0.05in] w_ r & = & n^2 - \sum _ i n_{i \cdot }^2 \\[0.05in] w_ c & = & n^2 - \sum _ j n_{\cdot j}^2 \\[0.05in] d_{ij} & = & A_{ij} - D_{ij} \\[0.05in] v_{ij} & = & n_{i \cdot } w_ c + n_{\cdot j} w_ r \end{eqnarray*}$

See Kendall (1955) for more information.

The variance under the null hypothesis that tau-b equals zero is computed as

$\mr{Var}_0(t_ b) = \frac{4}{w_ r w_ c} \left( \sum _ i \sum _ j n_{ij} (A_{ij} - D_{ij})^2 - (P-Q)^2/n \right)$

See Brown and Benedetti (1977) for details.

PROC FREQ also provides an exact test for the Kendall’s tau-b. You can request this test by specifying the KENTB option in the EXACT statement. See the section Exact Statistics for more information.

Stuart’s Tau-c

Stuart’s tau-c ( $\tau _ c$ ) makes an adjustment for table size in addition to a correction for ties. Tau-c is appropriate only when both variables lie on an ordinal scale. The range of tau-c is $-1 \leq \tau _ c \leq 1$ . Stuart’s tau-c is computed as

$t_ c = m(P - Q) ~ / ~ n^2(m-1)$

and the asymptotic variance is

$\mr{Var}(t_ c) = \frac{4m^2}{(m - 1)^2 n^4} \left( \sum _ i \sum _ j n_{ij} d_{ij}^2 - (P-Q)^2/n \right)$

where $m = \min (R,C)$ and $d_{ij} = A_{ij} - D_{ij}$ . The variance under the null hypothesis that tau-c equals zero is the same as the asymptotic variance Var,

$\mr{Var}_0(t_ c) = \mr{Var}(t_ c)$

See Brown and Benedetti (1977) for details.

PROC FREQ also provides an exact test for the Stuart’s tau-c. You can request this test by specifying the STUTC option in the EXACT statement. See the section Exact Statistics for more information.

Somers’ D

Somers’ $D(C|R)$ and Somers’ $D(R|C)$ are asymmetric modifications of tau-b. $C|R$ indicates that the row variable X is regarded as the independent variable and the column variable Y is regarded as dependent. Similarly, $R|C$ indicates that the column variable Y is regarded as the independent variable and the row variable X is regarded as dependent. Somers’ D differs from tau-b in that it uses a correction only for pairs that are tied on the independent variable. Somers’ D is appropriate only when both variables lie on an ordinal scale. The range of Somers’ D is $-1 \leq D \leq 1$ . Somers’ $D(C|R)$ is computed as

$D(C|R) = (P - Q) ~ / ~ w_ r$

and its asymptotic variance is

$\mr{Var}(D(C|R)) = \frac{4}{w_ r^4} \sum _ i \sum _ j n_{ij} \bigl ( w_ r d_{ij} - (P - Q)(n - n_{i \cdot }) \bigr )^2$

where $d_{ij} = A_{ij} - D_{ij}$ and

$w_ r = n^2 - \sum _ i n_{i \cdot }^2$

For more information, see Somers (1962); Goodman and Kruskal (1979); Liebetrau (1983).

The variance under the null hypothesis that $D(C|R)$ equals zero is computed as

$\mr{Var}_0(D(C|R)) = \frac{4}{w_ r^2} \left( \sum _ i \sum _ j n_{ij} (A_{ij} - D_{ij})^2 - (P-Q)^2/n \right)$

See Brown and Benedetti (1977) for details.

Formulas for Somers’ $D(R|C)$ are obtained by interchanging the indices.

PROC FREQ also provides exact tests for Somers’ $D (C|R)$ and $(R|C)$ . You can request these tests by specifying the SMDCR and SMDCR options in the EXACT statement. See the section Exact Statistics for more information.

Pearson Correlation Coefficient

The Pearson correlation coefficient ( $\rho$ ) is computed by using the scores specified in the SCORES= option. This measure is appropriate only when both variables lie on an ordinal scale. The range of the Pearson correlation is $-1 \leq \rho \leq 1$ . The Pearson correlation coefficient is computed as

$r = v / w = \mi{ss}_{rc} / \sqrt {\mi{ss}_ r \mi{ss}_ c}$

and its asymptotic variance is

$\mr{Var}(r) = \frac{1}{w^4} \sum _ i \sum _ j n_{ij} \left( w (R_ i - \bar{R}) (C_ j - \bar{C}) - \frac{b_{ij} v}{2w} \right)^2$

where $R_ i$ and $C_ j$ are the row and column scores and

$\begin{eqnarray*} \mi{ss}_ r & = & \sum _ i \sum _ j n_{ij} (R_ i-\bar{R})^2 \\[0.10in] \mi{ss}_ c & = & \sum _ i \sum _ j n_{ij} (C_ j-\bar{C})^2 \\[0.10in] \mi{ss}_{rc} & = & \sum _ i \sum _ j n_{ij} (R_ i-\bar{R})(C_ j-\bar{C}) \end{eqnarray*}$

$\begin{eqnarray*} b_{ij} & = & (R_ i-\bar{R})^2 \mi{ss}_ c + (C_ j-\bar{C})^2 \mi{ss}_ r \\[0.10in] v & = & \mi{ss}_{rc} \\[0.10in] w & = & \sqrt {\mi{ss}_ r \mi{ss}_ c} \end{eqnarray*}$

See Snedecor and Cochran (1989) for more information.

The SCORES= option in the TABLES statement determines the type of row and column scores used to compute the Pearson correlation (and other score-based statistics). The default is SCORES=TABLE. See the section Scores for details about the available score types and how they are computed.

The variance under the null hypothesis that the correlation equals zero is computed as

$\mr{Var}_0(r) = \left( \sum _ i \sum _ j n_{ij} (R_ i - \bar{R})^2 (C_ j - \bar{C})^2 - \mi{ss}_{rc}^2 / n \right) ~ / ~ \mi{ss}_ r \mi{ss}_ c$

Note that this expression for the variance is derived for multinomial sampling in a contingency table framework, and it differs from the form obtained under the assumption that both variables are continuous and normally distributed. See Brown and Benedetti (1977) for details.

PROC FREQ also provides an exact test for the Pearson correlation coefficient. You can request this test by specifying the PCORR option in the EXACT statement. See the section Exact Statistics for more information.

Spearman Rank Correlation Coefficient

The Spearman correlation coefficient ( $\rho _ s$ ) is computed by using rank scores, which are defined in the section Scores. This measure is appropriate only when both variables lie on an ordinal scale. The range of the Spearman correlation is $-1 \leq \rho _ s \leq 1$ . The Spearman correlation coefficient is computed as

$r_ s = v ~ / ~ w$

and its asymptotic variance is

$\mr{Var}(r_ s) = \frac{1}{n^2 w^4} \sum _ i \sum _ j n_{ij} (z_{ij} - \bar{z})^2$

where $R^1_ i$ and $C^1_ j$ are the row and column rank scores and

$\begin{eqnarray*} v & = & \sum _ i \sum _ j n_{ij} R(i) C(j) \\[0.10in] w & = & \frac{1}{12} \sqrt {FG} \\[0.10in] F & = & n^3 - \sum _ i n_{i \cdot }^3 \\[0.10in] G & = & n^3 - \sum _ j n_{\cdot j}^3 \\[0.10in] R(i) & = & R^1_ i - n/2 \\[0.10in] C(j) & = & C^1_ j - n/2 \\[0.10in] \bar{z} & = & \frac{1}{n} \sum _ i \sum _ j n_{ij} z_{ij} \\[0.10in] z_{ij} & = & wv_{ij} - vw_{ij} \end{eqnarray*}$

$\begin{eqnarray*} v_{ij} & = & n \left( R(i) C(j) ~ + ~ \frac{1}{2} \sum _ l n_{il} C(l) ~ + ~ \frac{1}{2} \sum _ k n_{kj} R(k) ~ + \right. \\ & & \left. \hspace{1in} \sum _ l \sum _{k>i} n_{kl} C(l) ~ + ~ \sum _ k \sum _{l>j} n_{kl} R(k) \right) \\[0.10in] w_{ij} & = & \frac{-n}{96w} \left( F n_{\cdot j}^2 + G n_{i \cdot }^2 \right) \end{eqnarray*}$

See Snedecor and Cochran (1989) for more information.

The variance under the null hypothesis that the correlation equals zero is computed as

$\mr{Var}_0(r_ s) = \frac{1}{n^2 w^2} \sum _ i \sum _ j n_{ij} (v_{ij} - \bar{v})^2$

where

$\bar{v} = \sum _ i \sum _ j n_{ij} v_{ij} / n$

Note that the asymptotic variance is derived for multinomial sampling in a contingency table framework, and it differs from the form obtained under the assumption that both variables are continuous and normally distributed. See Brown and Benedetti (1977) for details.

PROC FREQ also provides an exact test for the Spearman correlation coefficient. You can request this test by specifying the SCORR option in the EXACT statement. See the section Exact Statistics for more information.

Polychoric Correlation

When you specify the PLCORR option in the TABLES statement, PROC FREQ computes the polychoric correlation and its standard error. The polychoric correlation is based on the assumption that the two ordinal, categorical variables of the frequency table have an underlying bivariate normal distribution. The polychoric correlation coefficient is the maximum likelihood estimate of the product-moment correlation between the underlying normal variables. The range of the polychoric correlation is from –1 to 1. For $2 \times 2$ tables, the polychoric correlation is also known as the tetrachoric correlation (and it is labeled as such in the displayed output). See Drasgow (1986) for an overview of polychoric correlation coefficient.

Olsson (1979) gives the likelihood equations and the asymptotic standard errors for estimating the polychoric correlation. The underlying continuous variables relate to the observed crosstabulation table through thresholds, which define a range of numeric values that correspond to each categorical (table) level. PROC FREQ uses Olsson’s maximum likelihood method for simultaneous estimation of the polychoric correlation and the thresholds. (Olsson also presents a two-step method that estimates the thresholds first.)

PROC FREQ iteratively solves the likelihood equations by using a Newton-Raphson algorithm. The initial estimates of the thresholds are computed from the inverse of the normal distribution function at the cumulative marginal proportions of the table. Iterative computation of the polychoric correlation stops when the convergence measure falls below the convergence criterion or when the maximum number of iterations is reached, whichever occurs first. For parameter values that are less than 0.01, the procedure evaluates convergence by using the absolute difference instead of the relative difference. The PLCORR(CONVERGE=) option specifies the convergence criterion, which is 0.0001 by default. The PLCORR(MAXITER=) option specifies the maximum number of iterations, which is 20 by default.

If you specify the CL option in the TABLES statement, PROC FREQ provides confidence limits for the polychoric correlation. The confidence limits are computed as

$\hat{\rho } ~ \pm ~ ( ~ z_{\alpha /2} \times \mr{SE}(\hat{\rho }) ~ )$

where $\hat{\rho }$ is the estimate of the polychoric correlation, $z_{\alpha /2}$ is the $100(1 - \alpha /2)$ percentile of the standard normal distribution, and $\mr{SE}(\hat{\rho })$ is the standard error of the polychoric correlation estimate.

If you specify the PLCORR option in the TEST statement, PROC FREQ provides Wald and likelihood ratio tests of the null hypothesis that the polychoric correlation equals 0. The Wald test statistic is computed as

$z = \hat{\rho } ~ / ~ \mr{SE}(\hat{\rho })$

which has a standard normal distribution under the null hypothesis. PROC FREQ computes one-sided and two-sided p-values for the Wald test. When the test statistic z is greater than its null expected value of 0, PROC FREQ displays the right-sided p-value. When the test statistic is less than or equal to 0, PROC FREQ displays the left-sided p-value.

The likelihood ratio statistic for the polychoric correlation is computed as

$G^2 = -2 ~ \ln ( L_0 / L_1 )$

where $L_0$ is the value of the likelihood function (Olsson, 1979) when the polychoric correlation is 0, and $L_1$ is the value of the likelihood function at the maximum (where all parameters are replaced by their maximum likelihood estimates). Under the null hypothesis, the likelihood ratio statistic has an asymptotic chi-square distribution with one degree of freedom.

Lambda (Asymmetric)

Asymmetric lambda, $\lambda (C|R)$ , is interpreted as the probable improvement in predicting the column variable Y given knowledge of the row variable X. The range of asymmetric lambda is $0 \leq \lambda (C|R) \leq 1$ . Asymmetric lambda ( $C|R$ ) is computed as

$\lambda (C|R) = \frac{\sum _ i r_ i - r}{n - r}$

and its asymptotic variance is

$\mr{Var}(\lambda (C|R)) = \frac{n - \sum _ i r_ i}{(n - r)^3} \left( \sum _ i r_ i + r - 2 \sum _ i (r_ i~ |~ l_ i = l) \right)$

where

$\begin{eqnarray*} r_ i & = & \max _ j (n_{ij}) \\[0.10in] r & = & \max _ j (n_{\cdot j}) \\[0.10in] c_ j & = & \max _ i (n_{ij}) \\[0.10in] c & = & \max _ i (n_{i \cdot }) \end{eqnarray*}$

The values of $l_ i$ and l are determined as follows. Denote by $l_ i$ the unique value of j such that $r_ i=n_{ij}$ , and let l be the unique value of j such that $r=n_{\cdot j}$ . Because of the uniqueness assumptions, ties in the frequencies or in the marginal totals must be broken in an arbitrary but consistent manner. In case of ties, l is defined as the smallest value of j such that $r=n_{\cdot j}$ .

For those columns containing a cell (i, j) for which $n_{ij} = r_ i = c_ j$ , $cs_ j$ records the row in which $c_ j$ is assumed to occur. Initially $cs_ j$ is set equal to –1 for all j. Beginning with i=1, if there is at least one value j such that $n_{ij}=r_ i=c_ j$ , and if $cs_ j = -1$ , then $l_ i$ is defined to be the smallest such value of j, and $cs_ j$ is set equal to i. Otherwise, if $n_{il}=r_ i$ , then $l_ i$ is defined to be equal to l. If neither condition is true, then $l_ i$ is taken to be the smallest value of j such that $n_{ij}=r_ i$ .

The formulas for lambda asymmetric $(R|C)$ can be obtained by interchanging the indices.

See Goodman and Kruskal (1979) for more information.

Lambda (Symmetric)

The nondirectional lambda is the average of the two asymmetric lambdas, $\lambda (C|R)$ and $\lambda (R|C)$ . Its range is $0 \leq \lambda \leq 1$ . Lambda symmetric is computed as

$\lambda = \frac{\sum _ i r_ i + \sum _ j c_ j - r - c}{2n - r - c} = \frac{w - v}{w}$

and its asymptotic variance is computed as

$\mr{Var}(\lambda ) = \frac{1}{w^4} \Bigl ( wvy - 2w^2 \bigl ( n-\sum _ i \sum _ j (n_{ij}~ |~ j=l_ i,i=k_ j) \bigr ) - 2v^2 (n - n_{kl}) \Bigr )$

where

$\begin{eqnarray*} r_ i & = & \max _ j (n_{ij}) \\[0.10in] r & = & \max _ j (n_{\cdot j}) \\[0.10in] c_ j & = & \max _ i (n_{ij}) \\[0.10in] c & = & \max _ i (n_{i \cdot }) \\[0.10in] w & = & 2n - r - c \\[0.10in] v & = & 2n - \sum _ i r_ i - \sum _ j c_ j \\[0.10in] x & = & \sum _ i (r_ i ~ |~ l_ i=l ) ~ + ~ \sum _ j (c_ j ~ |~ k_ j=k) ~ + ~ r_ k ~ + ~ c_ l \\[0.10in] y & = & 8n - w - v - 2x \end{eqnarray*}$

The definitions of $l_ i$ and l are given in the previous section. The values $k_ j$ and k are defined in a similar way for lambda asymmetric ( $R|C$ ).

See Goodman and Kruskal (1979) for more information.

Uncertainty Coefficients (Asymmetric)

The uncertainty coefficient $U(C|R)$ measures the proportion of uncertainty (entropy) in the column variable Y that is explained by the row variable X. Its range is $0 \leq U(C|R) \leq 1$ . The uncertainty coefficient is computed as

$U(C|R) = \left( H(X) + H(Y) - H(XY) \right) ~ / ~ H(Y) = v / w$

and its asymptotic variance is

$\mr{Var}(U(C|R)) = \frac{1}{n^2 w^4} \sum _ i \sum _ j n_{ij} \bigl ( H(Y) \ln \left( \frac{n_{ij}}{n_{i \cdot }} \right) + (H(X) - H(XY)) \ln \left( \frac{n_{\cdot j}}{n} \right) \bigr )^2$

where

$\begin{eqnarray*} v & = & H(X) + H(Y) - H(XY) \\[0.10in] w & = & H(Y) \\[0.10in] H(X) & = & -\sum _ i \left( \frac{n_{i \cdot }}{n} \right) \ln \left( \frac{n_{i \cdot }}{n} \right) \\[0.10in] H(Y) & = & -\sum _ j \left( \frac{n_{\cdot j}}{n} \right) \ln \left( \frac{n_{\cdot j}}{n} \right) \\[0.10in] H(XY) & = & -\sum _ i \sum _ j \left( \frac{n_{ij}}{n} \right) \ln \left( \frac{n_{ij}}{n} \right) \end{eqnarray*}$

The formulas for the uncertainty coefficient $U(R|C)$ can be obtained by interchanging the indices.

See Theil (1972, pp. 115–120) and Goodman and Kruskal (1979) for more information.

Uncertainty Coefficient (Symmetric)

The uncertainty coefficient U is the symmetric version of the two asymmetric uncertainty coefficients. Its range is $0 \leq U \leq 1$ . The uncertainty coefficient is computed as

$U = 2 \left( H(X) + H(Y) - H(XY) \right) ~ / ~ ( H(X) + H(Y) )$

and its asymptotic variance is

$\mr{Var}(U) = 4 \sum _ i \sum _ j \frac{ n_{ij} \left( H(XY) \ln \left( \frac{n_{i \cdot } n_{\cdot j}}{n^2} \right) - (H(X) + H(Y)) \ln \left( \frac{n_{ij}}{n} \right) \right)^2 }{n^2 ~ (H(X) + H(Y))^4}$

where $H(X)$ , $H(Y)$ , and $H(XY)$ are defined in the previous section. See Goodman and Kruskal (1979) for more information.