|
|
|
|
| Previous Page | Next Page |
| The FREQ Procedure |
|
|
| Statistical Computations |
A two-way table represents the crosstabulation of row variable X and column variable Y. Let the table row values or levels be denoted by
,
, and the column values by
,
. Let
denote the frequency of the table cell in the
th row and
th column and define the following notation:
![]() |
![]() |
![]() |
![]() |
PROC FREQ uses scores of the variable values to compute the Mantel-Haenszel chi-square, Pearson correlation, Cochran-Armitage test for trend, weighted kappa coefficient, and Cochran-Mantel-Haenszel statistics. The SCORES= option in the TABLES statement specifies the score type that PROC FREQ uses. The available score types are TABLE, RANK, RIDIT, and MODRIDIT scores. The default score type is TABLE. Using MODRIDIT, RANK, or RIDIT scores yields nonparametric analyses.
For numeric variables, table scores are the values of the row and column levels. If the row or column variable is formatted, then the table score is the internal numeric value corresponding to that level. If two or more numeric values are classified into the same formatted level, then the internal numeric value for that level is the smallest of these values. For character variables, table scores are defined as the row numbers and column numbers (that is, 1 for the first row, 2 for the second row, and so on).
Rank scores, which you request with the SCORES=RANK option, are defined as
![]() |
where
is the rank score of row
, and
is the rank score of column
. Note that rank scores yield midranks for tied values.
Ridit scores, which you request with the SCORES=RIDIT option, are defined as rank scores standardized by the sample size (Bross 1958, Mack and Skillings 1980). Ridit scores are derived from the rank scores as
![]() |
Modified ridit scores (SCORES=MODRIDIT) represent the expected values of the order statistics of the uniform distribution on (0,1) (van Elteren 1960, Lehmann 1975). Modified ridit scores are derived from rank scores as
![]() |
The CHISQ option provides chi-square tests of homogeneity or independence and measures of association based on the chi-square statistic. When you specify the CHISQ option in the TABLES statement, PROC FREQ computes the following chi-square tests for each two-way table: the Pearson chi-square, likelihood-ratio chi-square, and Mantel-Haenszel chi-square. PROC FREQ provides the following measures of association based on the Pearson chi-square statistic: the phi coefficient, contingency coefficient, and Cramer’s
. For
tables, the CHISQ option also provides Fisher’s exact test and the continuity-adjusted chi-square. You can request Fisher’s exact test for general
tables by specifying the FISHER option in the TABLES or EXACT statement.
For one-way frequency tables, the CHISQ option provides a chi-square goodness-of-fit test. The other chi-square tests and statistics described in this section are computed only for two-way tables.
All of the two-way test statistics described in this section
test the null hypothesis of no association between the row variable and the column variable. When the sample size
is large, these test statistics have an asymptotic chi-square distribution when the null hypothesis is true. When the sample size is not large, exact tests might be useful. PROC FREQ provides exact tests for the Pearson chi-square, the likelihood-ratio chi-square, and the Mantel-Haenszel chi-square (in addition to Fisher’s exact test). PROC FREQ also provides an exact chi-square goodness-of-fit test for one-way tables. You can request these exact tests by specifying the corresponding options in the EXACT statement. See the section Exact Statistics for more information.
Note that the Mantel-Haenszel chi-square statistic is appropriate only when both variables lie on an ordinal scale. The other chi-square tests and statistics in this section are appropriate for either nominal or ordinal variables. The following sections give the formulas that PROC FREQ uses to compute the chi-square tests and statistics. See Agresti (2007), Stokes, Davis, and Koch (2000), and the other references cited for each statistic for more information.
For one-way frequency tables, the CHISQ option in the TABLES statement provides a chi-square goodness-of-fit test. Let
denote the number of classes, or levels, in the one-way table. Let
denote the frequency of class
(or the number of observations in class
) for
. Then PROC FREQ computes the one-way chi-square statistic as
![]() |
where
is the expected frequency for class
under the null hypothesis.
In the test for equal proportions, which is the default for the CHISQ option, the null hypothesis specifies equal proportions of the total sample size for each class. Under this null hypothesis, the expected frequency for each class equals the total sample size divided by the number of classes,
![]() |
In the test for specified frequencies, which PROC FREQ
computes when you input null hypothesis frequencies by using the TESTF= option, the expected frequencies are the TESTF= values that you specify. In the test for specified proportions, which PROC FREQ computes when you input null hypothesis proportions by using the TESTP= option,
the expected frequencies are determined from the specified TESTP= proportions
as
![]() |
Under the null hypothesis (of equal proportions, specified frequencies, or specified proportions),
has an asymptotic chi-square distribution with
degrees of freedom.
In addition to the asymptotic test, you can request an exact one-way chi-square test by specifying the CHISQ option in the EXACT statement. See the section Exact Statistics for more information.
The Pearson chi-square for two-way tables involves the differences between the observed and expected frequencies, where the expected frequencies are computed under the null hypothesis of independence. The Pearson chi-square statistic is computed as
![]() |
where
is the observed frequency in table cell (
) and
is the expected frequency for table cell (
). The expected frequency is computed under the null hypothesis that the row and column variables are independent,
![]() |
When the row and column variables are independent,
has an asymptotic chi-square distribution with
degrees of freedom. For large values of
, this test rejects the null hypothesis in favor of the alternative hypothesis of general association.
In addition to the asymptotic test, you can request an exact Pearson chi-square test by specifying the PCHI or CHISQ option in the EXACT statement. See the section Exact Statistics for more information.
For
tables, the Pearson chi-square is also appropriate for testing the equality of two binomial proportions. For
and
tables, the Pearson chi-square tests the homogeneity of proportions. See Fienberg (1980) for details.
The likelihood-ratio chi-square involves the ratios between the observed and expected frequencies. The likelihood-ratio chi-square statistic is computed as
![]() |
where
is the observed frequency in table cell (
) and
is the expected frequency for table cell (
).
When the row and column variables are independent,
has an asymptotic chi-square distribution with
degrees of freedom.
In addition to the asymptotic test, you can request an exact likelihood-ratio chi-square test by specifying the LRCHI or CHISQ option in the EXACT statement. See the section Exact Statistics for more information.
The continuity-adjusted chi-square for
tables is similar to the Pearson chi-square, but it is adjusted for the continuity of the chi-square distribution. The continuity-adjusted chi-square is most useful for small sample sizes. The use of the continuity adjustment is somewhat controversial; this chi-square test is more conservative (and more like Fisher’s exact test) when the sample size is small. As the sample size increases, the continuity-adjusted chi-square becomes more like the Pearson chi-square.
The continuity-adjusted chi-square statistic is computed as
![]() |
Under the null hypothesis of independence,
has an asymptotic chi-square distribution with
degrees of freedom.
The Mantel-Haenszel chi-square statistic tests the alternative hypothesis that there is a linear association between the row variable and the column variable. Both variables must lie on an ordinal scale. The Mantel-Haenszel chi-square statistic is computed as
![]() |
where
is the Pearson correlation between the row variable and the column variable. For a description of the Pearson correlation, see the Pearson Correlation Coefficient. The Pearson correlation and thus the Mantel-Haenszel chi-square statistic use the scores that you specify in the SCORES= option in the TABLES statement. See Mantel and Haenszel (1959) and Landis, Heyman, and Koch (1978) for more information.
Under the null hypothesis of no association,
has an asymptotic chi-square distribution with one degree of freedom.
In addition to the asymptotic test, you can request an exact Mantel-Haenszel chi-square test by specifying the MHCHI or CHISQ option in the EXACT statement. See the section Exact Statistics for more information.
Fisher’s exact test is another test of association between the row and column variables. This test assumes that the row and column totals are fixed, and then uses the hypergeometric distribution to compute probabilities of possible tables conditional on the observed row and column totals. Fisher’s exact test does not depend on any large-sample distribution assumptions, and so it is appropriate even for small sample sizes and for sparse tables.
2 TablesFor
tables, PROC FREQ gives the following information for Fisher’s exact test: table probability, two-sided
-value, left-sided
-value, and right-sided
-value. The table probability equals the hypergeometric probability of the observed table, and is in fact the value of the test statistic for Fisher’s exact test.
Where
is the hypergeometric probability of a specific table with the observed row and column totals, Fisher’s exact
-values are computed by summing probabilities
over defined sets of tables,
![]() |
The two-sided
-value is the sum of all possible table probabilties (conditional on the observed row and column totals) that are less than or equal to the observed table probability. For the two-sided
-value, the set
includes all possible tables with hypergeometric probabilities less than or equal to the probability of the observed table. A small two-sided
-value supports the alternative hypothesis of association between the row and column variables.
For
tables, one-sided
-values for Fisher’s exact test are defined in terms of the frequency of the cell in the first row and first column of the table, the (1,1) cell. Denoting the observed (1,1) cell frequency by
, the left-sided
-value for Fisher’s exact test is the probability that the (1,1) cell frequency is less than or equal to
. For the left-sided
-value, the set
includes those tables with a (1,1) cell frequency less than or equal to
. A small left-sided
-value supports the alternative hypothesis that the probability of an observation being in the first cell is actually less than expected under the null hypothesis of independent row and column variables.
Similarly, for a right-sided alternative hypothesis,
is the set of tables where the frequency of the (1,1) cell is greater than or equal to that in the observed table. A small right-sided
-value supports the alternative that the probability of the first cell is actually greater than that expected under the null hypothesis.
Because the (1,1) cell frequency completely determines the
table when the marginal row and column sums are fixed, these one-sided alternatives can be stated equivalently in terms of other cell probabilities or ratios of cell probabilities. The left-sided alternative is equivalent to an odds ratio less than 1, where the odds ratio equals (
). Additionally, the left-sided alternative is equivalent to the column 1 risk for row 1 being less than the column 1 risk for row 2,
. Similarly, the right-sided alternative is equivalent to the column 1 risk for row 1 being greater than the column 1 risk for row 2,
. See Agresti (2007) for details.
C TablesFisher’s exact test was extended to general
tables by Freeman and Halton (1951), and this test is also known as the Freeman-Halton test. For
tables, the two-sided
-value definition is the same as for
tables. The set
contains all tables with
less than or equal to the probability of the observed table. A small
-value supports the alternative hypothesis of association between the row and column variables. For
tables, Fisher’s exact test is inherently two-sided. The alternative hypothesis is defined only in terms of general, and not linear, association. Therefore, Fisher’s exact test does not have right-sided or left-sided
-values for general
tables.
For
tables, PROC FREQ computes Fisher’s exact test by using the network algorithm of Mehta and Patel (1983), which provides a faster and more efficient solution than direct enumeration. See the section Exact Statistics for more details.
The phi coefficient is a measure of association derived from the Pearson chi-square. The range of the phi coefficient is
for
tables. For tables larger than
, the range is
(Liebetrau 1983). The phi coefficient is computed as
![]() |
![]() |
See Fleiss, Levin, and Paik (2003, pp. 98–99) for more information.
The contingency coefficient is a measure of association derived from the Pearson chi-square. The range of the contingency coefficient is
, where
(Liebetrau 1983). The contingency coefficient is computed as
![]() |
See Kendall and Stuart (1979, pp. 587–588) for more information.
Cramer’s
is a measure of association derived from the Pearson chi-square. It is designed so that the attainable upper bound is always 1. The range of Cramer’s
is
for
tables; for tables larger than
, the range is
. Cramer’s
is computed as
![]() |
![]() |
See Kendall and Stuart (1979, p. 588) for more information.
When you specify the MEASURES option in the TABLES statement, PROC FREQ computes several statistics that describe the association between the row and column variables of the contingency table. The following are measures of ordinal association that consider whether the column variable Y tends to increase as the row variable X increases: gamma, Kendall’s tau-
, Stuart’s tau-
, and Somers’
. These measures are appropriate for ordinal variables, and they classify pairs of
observations as concordant or discordant. A pair is concordant if the observation with the larger value of X also has the larger value of Y. A pair is discordant if the observation with the larger value of X has the smaller value of Y. See Agresti (2007) and the other references cited for the individual measures of association.
The Pearson correlation coefficient and the Spearman rank correlation coefficient are also appropriate for ordinal variables. The Pearson correlation describes the strength of the linear association between the row and column variables, and it is computed by using the row and column scores specified by the SCORES= option in the TABLES statement. The Spearman correlation is computed with rank scores. The polychoric correlation (requested by the PLCORR option) also requires ordinal variables and assumes that the variables have an underlying bivariate normal distribution. The following measures of association do not require ordinal variables and are appropriate for nominal variables: lambda asymmetric, lambda symmetric, and the uncertainty coefficients.
PROC FREQ computes estimates of the measures according to the formulas given in the following sections. For each measure, PROC FREQ computes an asymptotic standard error (
), which is the square root of the asymptotic variance denoted by
in the following sections.
If you specify the CL option in the TABLES statement, PROC FREQ computes asymptotic confidence limits for all MEASURES statistics. The confidence coefficient is determined according to the value of the ALPHA= option, which, by default, equals 0.05 and produces 95% confidence limits.
The confidence limits are computed as
![]() |
where
is the estimate of the measure,
is the
th percentile of the standard normal distribution, and
is the asymptotic standard error of the estimate.
For each measure that you specify in the TEST statement, PROC FREQ computes an asymptotic test of the null hypothesis that the measure equals zero. Asymptotic tests are available for the following measures of association: gamma, Kendall’s tau-
, Stuart’s tau-
, Somers’
, Somers’
, the Pearson correlation coefficient, and the Spearman rank correlation coefficient. To compute an asymptotic test, PROC FREQ uses a standardized test statistic
, which has an asymptotic standard normal distribution under the null hypothesis. The test statistic is computed as
![]() |
where
is the estimate of the measure and
is the variance of the estimate under the null hypothesis. Formulas for
for the individual measures of association are given in the following sections.
Note that the ratio of
to
is the same for the following measures: gamma, Kendall’s tau-
, Stuart’s tau-
, Somers’
, and Somers’
. Therefore, the tests for these measures are identical. For example, the
-values for the test of
equal the
-values for the test of
.
PROC FREQ computes one-sided and two-sided
-values for each of these tests. When the test statistic
is greater than its null hypothesis expected value of zero, PROC FREQ displays the right-sided
-value, which is the probability of a larger value of the statistic occurring under the null hypothesis. A small right-sided
-value supports the alternative hypothesis that the true value of the measure is greater than zero. When the test statistic is less than or equal to zero, PROC FREQ displays the left-sided
-value, which is the probability of a smaller value of the statistic occurring under the null hypothesis. A small left-sided
-value supports the alternative hypothesis that the true value of the measure is less than zero. The one-sided
-value
can be expressed as
![]() |
where
has a standard normal distribution. The two-sided p-value
is computed as
![]() |
Exact tests are available for two measures of association: the Pearson correlation coefficient and the Spearman rank correlation coefficient. If you specify the PCORR option in the EXACT statement, PROC FREQ computes the exact test of the hypothesis that the Pearson correlation equals zero. If you specify the SCORR option in the EXACT statement, PROC FREQ computes the exact test of the hypothesis that the Spearman correlation equals zero. See the section Exact Statistics for more information.
The gamma (
) statistic is based only on the number of concordant and discordant pairs of observations. It ignores tied pairs (that is, pairs of observations that have equal values of
or equal values of
). Gamma is appropriate only when both variables lie on an ordinal scale. The range of gamma is
. If the row and column variables are independent, then gamma tends to be close to zero. Gamma is estimated by
![]() |
and the asymptotic variance is
![]() |
For
tables, gamma is equivalent to Yule’s
. See Goodman and Kruskal (1979) and Agresti (2002) for more information.
The variance under the null hypothesis that gamma equals zero is computed as
![]() |
See Brown and Benedetti (1977) for details.
Kendall’s tau-
(
) is similar to gamma except that tau-
uses a correction for ties. Tau-
is appropriate only when both variables lie on an ordinal scale. The range of tau-
is
. Kendall’s tau-
is estimated by
![]() |
and the asymptotic variance is
![]() |
where
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
See Kendall (1955) for more information.
The variance under the null hypothesis that tau-
equals zero is computed as
![]() |
See Brown and Benedetti (1977) for details.
Stuart’s tau-
(
) makes an adjustment for table size in addition to a correction for ties. Tau-
is appropriate only when both variables lie on an ordinal scale. The range of tau-
is
. Stuart’s tau-
is estimated by
![]() |
and the asymptotic variance is
![]() |
where
and
. The variance under the null hypothesis that tau-
equals zero is the same as the asymptotic variance
,
![]() |
See Brown and Benedetti (1977) for details.
Somers’
and Somers’
are asymmetric modifications of tau-
.
indicates that the row variable X is regarded as the independent variable and the column variable Y is regarded as dependent. Similarly,
indicates that the column variable Y is regarded as the independent variable and the row variable X is regarded as dependent. Somers’
differs from tau-
in that it uses a correction only for pairs that are tied on the independent variable. Somers’
is appropriate only when both variables lie on an ordinal scale. The range of Somers’
is
. Somers’
is computed as
![]() |
and its asymptotic variance is
![]() |
where
and
![]() |
See Somers (1962), Goodman and Kruskal (1979), and Liebetrau (1983) for more information.
The variance under the null hypothesis that
equals zero is computed as
![]() |
See Brown and Benedetti (1977) for details.
Formulas for Somers’
are obtained by interchanging the indices.
The Pearson correlation coefficient (
) is computed by using the scores specified in the SCORES= option. This measure is appropriate only when both variables lie on an ordinal scale. The range of the Pearson correlation is
. The Pearson correlation coefficient is estimated by
![]() |
and its asymptotic variance is
![]() |
where
and
are the row and column scores and
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
See Snedecor and Cochran (1989) for more information.
The SCORES= option in the TABLES statement determines the type of row and column scores used to compute the Pearson correlation (and other score-based statistics). The default is SCORES=TABLE. See the section Scores for details about the available score types and how they are computed.
The variance under the null hypothesis that the correlation equals zero is computed as
![]() |
Note that this expression for the variance is derived for multinomial sampling in a contingency table framework, and it differs from the form obtained under the assumption that both variables are continuous and normally distributed. See Brown and Benedetti (1977) for details.
PROC FREQ also provides an exact test for the Pearson correlation coefficient. You can request this test by specifying the PCORR option in the EXACT statement. See the section Exact Statistics for more information.
The Spearman correlation coefficient (
) is computed by using rank scores, which are defined in the section Scores. This measure is appropriate only when both variables lie on an ordinal scale. The range of the Spearman correlation is
. The Spearman correlation coefficient is estimated by
![]() |
and its asymptotic variance is
![]() |
where
and
are the row and column rank scores and
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
See Snedecor and Cochran (1989) for more information.
The variance under the null hypothesis that the correlation equals zero is computed as
![]() |
where
![]() |
Note that the asymptotic variance is derived for multinomial sampling in a contingency table framework, and it differs from the form obtained under the assumption that both variables are continuous and normally distributed. See Brown and Benedetti (1977) for details.
PROC FREQ also provides an exact test for the Spearman correlation coefficient. You can request this test by specifying the SCORR option in the EXACT statement. See the section Exact Statistics for more information.
When you specify the PLCORR option in the TABLES statement, PROC FREQ computes the polychoric correlation. This measure of association is based on the assumption that the ordered, categorical variables of the frequency table have an underlying bivariate normal distribution. For
tables, the polychoric correlation is also known as the tetrachoric correlation. See Drasgow (1986) for an overview of polychoric correlation. The polychoric correlation coefficient is the maximum likelihood estimate of the product-moment correlation between the normal variables, estimating thresholds from the observed table frequencies. The range of the polychoric correlation is from –1 to 1. Olsson (1979) gives the likelihood equations and an asymptotic covariance matrix for the estimates.
To estimate the polychoric correlation, PROC FREQ iteratively solves the likelihood equations by a Newton-Raphson algorithm that uses the Pearson correlation coefficient as the initial approximation. Iteration stops when the convergence measure falls below the convergence criterion or when the maximum number of iterations is reached, whichever occurs first. The CONVERGE= option sets the convergence criterion, and the default value is 0.0001. The MAXITER= option sets the maximum number of iterations, and the default value is 20.
Asymmetric lambda,
, is interpreted as the probable improvement in predicting the column variable Y given knowledge of the row variable X. The range of asymmetric lambda is
. Asymmetric lambda (
) is computed as
![]() |
and its asymptotic variance is
![]() |
where
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
|||
![]() |
![]() |
![]() |
The values of
and
are determined as follows. Denote by
the unique value of
such that
, and let
be the unique value of
such that
. Because of the uniqueness assumptions, ties in the frequencies or in the marginal totals must be broken in an arbitrary but consistent manner. In case of ties,
is defined as the smallest value of
such that
.
For those columns containing a cell
for which
,
records the row in which
is assumed to occur. Initially
is set equal to –1 for all
. Beginning with
, if there is at least one value
such that
, and if
, then
is defined to be the smallest such value of
, and
is set equal to
. Otherwise, if
, then
is defined to be equal to
. If neither condition is true, then
is taken to be the smallest value of
such that
.
The formulas for lambda asymmetric
can be obtained by interchanging the indices.
See Goodman and Kruskal (1979) for more information.