The FREQ Procedure

Definitions and Notation

A two-way table represents the crosstabulation of row variable X and column variable Y. Let the table row values or levels be denoted by $X_ i$ , $i=1, 2, \ldots , R$ , and the column values by $Y_ j$ , $j=1, 2, \ldots , C$ . Let $n_{ij}$ denote the frequency of the table cell in the ith row and jth column and define the following notation:

$\begin{aligned} n_{i \cdot } & = \sum _ j n_{ij} & & \mbox{(row totals)} \\ n_{\cdot j} & = \sum _ i n_{ij} & & \mbox{(column totals)} \\ n & = \sum _ i \sum _ j n_{ij} & & \mbox{(overall total)} \\ p_{ij} & = n_{ij} / n & & \mbox{(cell percentages)} \\ p_{i \cdot } & = n_{i \cdot } / n & & \mbox{(row percentages of total)} \\ p_{\cdot j} & = n_{\cdot j} / n & & \mbox{(column percentages of total)} \end{aligned}$

$\begin{aligned} R_{i} & = \mbox{score for row } i \\ C_{j} & = \mbox{score for column } j \end{aligned}$

$\begin{aligned} \bar{R} & = \sum _ i n_{i \cdot } R_{i} / n & & \mbox{(average row score)} \\ \bar{C} & = \sum _ j n_{\cdot j} C_{j} / n & & \mbox{(average column score)} \end{aligned}$

$\begin{aligned} A_{ij} & = \sum _{k>i} ~ \sum _{l>j} n_{kl} + \sum _{k<i} ~ \sum _{l<j} n_{kl} \\ D_{ij} & = \sum _{k>i} ~ \sum _{l<j} n_{kl} + \sum _{k<i} ~ \sum _{l>j} n_{kl} \\ P & = \sum _ i \sum _ j n_{ij} A_{ij} \quad \mbox{(twice the number of concordances)} \\ Q & = \sum _ i \sum _ j n_{ij} D_{ij} \quad \mbox{(twice the number of discordances)} \end{aligned}$

Scores

PROC FREQ uses scores of the variable values to compute the Mantel-Haenszel chi-square, Pearson correlation, Cochran-Armitage test for trend, weighted kappa coefficient, and Cochran-Mantel-Haenszel statistics. The SCORES= option in the TABLES statement specifies the score type that PROC FREQ uses. The available score types are TABLE, RANK, RIDIT, and MODRIDIT scores. The default score type is TABLE. Using MODRIDIT, RANK, or RIDIT scores yields nonparametric analyses.

For numeric variables, table scores are the values of the row and column levels. If the row or column variable is formatted, then the table score is the internal numeric value corresponding to that level. If two or more numeric values are classified into the same formatted level, then the internal numeric value for that level is the smallest of these values. For character variables, table scores are defined as the row numbers and column numbers (that is, 1 for the first row, 2 for the second row, and so on).

Rank scores, which you request with the SCORES=RANK option, are defined as

$\begin{aligned} R^1_ i & = \sum _{k<i} n_{k \cdot } + (n_{i \cdot } + 1) / 2 \quad & & i = 1, 2, \ldots , R \\ C^1_ j & = \sum _{l<j} n_{\cdot l} + (n_{\cdot j} + 1) / 2 \quad & & j = 1, 2, \ldots , C \end{aligned}$

where $R^1_ i$ is the rank score of row i, and $C^1_ j$ is the rank score of column j. Note that rank scores yield midranks for tied values.

Ridit scores, which you request with the SCORES=RIDIT option, are defined as rank scores standardized by the sample size (Bross, 1958; Mack and Skillings, 1980). Ridit scores are derived from the rank scores as

$\begin{aligned} R^2_ i & = R^1_ i / n \quad & & i = 1, 2, \ldots , R \\ C^2_ j & = C^1_ j / n \quad & & j = 1, 2, \ldots , C \end{aligned}$

Modified ridit scores (SCORES=MODRIDIT) represent the expected values of the order statistics of the uniform distribution on (0,1) (van Elteren, 1960; Lehmann and D’Abrera, 2006). Modified ridit scores are derived from rank scores as

$\begin{aligned} R^3_ i & = R^1_ i / (n + 1) \quad & & i = 1, 2, \ldots , R \\ C^3_ j & = C^1_ j / (n + 1) \quad & & j = 1, 2, \ldots , C \end{aligned}$