SUPPORT / SAMPLES & SAS NOTES
 

Support

Sample 25006: Compute estimates and tests of agreement among multiple raters

DetailsResultsDownloadsAboutRate It

Compute estimates and tests of agreement among multiple raters

Contents: Purpose / History / Requirements / Usage / Details / Limitations / Missing Values / See Also / References

 

NOTE: For assessing agreement between two raters, use the AGREE option in PROC FREQ.

 

PURPOSE:
Compute estimates and tests of agreement among multiple raters when responses (ratings) are on a nominal or ordinal scale. For a nominal response, kappa and Gwet's AC1 agreement coefficient are available. For an ordinal response, Gwet's weighted agreement coefficient (AC2) and statistics based on a cumulative probit model with random effects for raters and items are available. If the response is numerically-coded (and possibly continuous), Kendall's coefficient of concordance is also available.
HISTORY:
The version of the MAGREE macro that you are using is displayed when you specify anything as the first macro argument. For example:
    %magree(v)

The MAGREE macro always attempts to check for a later version of itself. If it is unable to do this (such as if there is no active Internet connection available), the macro will issue the following message:

    NOTE: Unable to check for newer version of MAGREE macro.

The computations performed by the macro are not affected by the appearance of this message. However, this check can be avoided by specifying nochk as the first macro argument. This can be useful if your machine has no connection to the internet.

Version
Update Notes
3.91 Added check for nonpositive definite matrix for GLMM statistics. Fixed sign on item, rater effects. nochk can be specified as the first (version) parameter.
3.9 Improved error checking and messaging.
3.8 Added check for zero variance in Gwet statistics.
3.7 Added nominal and ordinal as valid values in stat=.
3.6 Added GLMM-based κm and κma (Nelson and Edwards, 2015, 2018). Changed stat= to accept a list. Added itemeffect and ratereffect options.
3.3 Added wcl option.
3.2 Added percategory option. Category kappas no longer displayed by default.
3.1 Suppress display of fixed-items and unconditional rows in Gwet results with fewer than three raters. Added (no)fixrateronly option.
3.0 Added Gwet's agreement coefficients. Added options noprint, noweights, and counts. Former nobaltab option changed to nosummary.
2.1 Fixed minor problem when name in items=, raters=, or response= differs in case from name in data.
2.0 Confidence intervals are provided for kappa (Vanbelle, 2018). Added alpha= option. Allow computation of kappa and confidence intervals (but not tests) with unequal number of ratings per item. Added optional table of rating counts across items.
1.3 Fixes errors caused by using the name COUNT or PERCENT for the items=, raters=, or response= variable.
1.2 Corrected cases of perfect agreement. Require more than one item and rater. Added check for new version, reset of _last_ data set and notes option at end.
1.1 Added error checking for parameters, data set, and variables. Added version option.
REQUIREMENTS:
Only Base SAS® is required for kappa and Gwet's statistics. SAS/STAT® is required for the Kendall and GLMM-based statistics. SAS/IML® is required for GLMM-based statistics.
USAGE:
Follow the instructions in the Downloads tab of this sample to save the MAGREE macro definition. Replace the text within quotation marks in the following statement with the location of the MAGREE macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the MAGREE macro and make it available for use:
   %inc "<location of your file containing the MAGREE macro>";

Following this statement, you may call the MAGREE macro. See the Results tab for an example.

The following parameters are required:

items=variable
Specifies the variable containing item (or subject) identifiers. This variable may be numeric or character. There must be more than one item.
raters=variable
Specifies the variable containing rater identifiers. This variable may be numeric or character. The same rater values must be used for all items even if items are rated by different raters. Note that Kendall's statistic is only valid when all items are rated by the same raters. There must be more than one rater.
response=variable
Specifies the variable containing the ratings. This variable may be numeric or character. If it is character, it is assumed to be nominal and only kappa and Gwet's statistics are computed. If it is numeric, any of the statistics can be computed, but note that the Kendall and GLMM-based statistics are only valid for ordinal responses. Any format assigned to the response variable is ignored.

The following parameters are optional:

data=data-set-name
Identifies the input data set to be analyzed. If not specified, the last-created data set is used. It should contain one observation per rating – that is, each observation should contain one rating on one item by one rater with variables containing item and rater identifiers and a variable containing the rating (response). The same set of rater values must be used for each item, even if items are rated by different raters (which is only valid for kappa and Gwet statistics).
stat=all|nominal | ordinal | statistics
Specifies the statistics to calculate. Specify nominal to request all relevant statistics for a nominal scale response, or specify ordinal to request all relevant statistics for a ordinal scale response, or specify one or more particular statistics separated by spaces. Statistics that you can specify include the following. Specify kappa to compute Fleiss' kappa statistics for a nominal response. Specify gwet to compute Gwet's AC1 statistic for a nominal response, or Gwet's AC2 statistic for an ordinal response if weight= is also specified. Specify kendall to compute Kendall's coefficient of concordance for an ordinal or continuous response. Specify glmm to compute the GLMM-based κm and κma statistics for an ordinal response. Note that the Kendall and GLMM-based statistics are not computed if the response= variable is a character variable. Use all to compute all statistics (if possible). By default, stat=all.
alpha=value
Specifies the alpha level for confidence intervals with confidence level equal to 1-value. Value must be between 0 and 1. The default is alpha=0.05.
weight=linear | quadratic | sqrt | id | value | data-set-name
Specifies the structure of a symmetric weight matrix to be computed for use in Gwet's AC2 statistic and the GLMM-based κma statistic for ordinal responses. The Kendall and Kappa statistics are not computed when weight= is specified. See Weights below for details. Value, if specified, must be between 0.01 and 5. Specifying weight=linear is equivalent to weight=1 as are weight=2 and weight=quadratic and weight=sqrt and weight=0.5. When weight=data-set-name, the specified data set must have only k observations and k numeric variables containing the weight matrix. Any names can be used for the variables in the data set. weight= is ignored unless gwet, glmm, or all is specified in stat=. If omitted, Gwet's AC1 is computed when the Gwet statistic is requested, and κm is computed when the GLMM-based statistics are requested.
wtparm=value
Specifies the parameter of the exponential distribution that can optionally be used in the computation of a symmetric weight matrix for Gwet's AC2 statistic or the GLMM-based κma statistic for ordinal responses. See Weights below for details. If specified, value must be greater than or equal to 0.01. Ignored unless weight= and either gwet, glmm, or all is specified in stat=. Also ignored if a data set name is specified in weight=.
options=options
One or more of the following options can be specified, separated by spaces.
counts
Displays a table showing the number of ratings in each category for each item.
noprint
Suppresses all displayed results. See Output data sets below.
nosummary
Suppresses display of the table showing the number of items that have specific numbers of ratings.
noweights
Suppresses display of the weight matrix, if specified.
t
Tests and confidence limits for Gwet's statistics and confidence limits for Kendall's W are computed using the t distribution. By default, or with option z, large-sample values are computed using the standard normal distribution.
z
Tests and confidence limits for Gwet's statistics and confidence limits for Kendall's W are computed using the standard normal distribution. This is the default.
fixrateronly
Suppresses computation of the fixed-items and unconditional variances and related Gwet statistics. This avoids the iterative jackknife method that becomes more time consuming as the number of items increases.
percategory
For Fleiss' kappa and Gwet's agreement coefficients, provides statistics for assessing agreement on each response category. Category-level agreement does not apply to the Kendall and GLMM-based statistics.
wcl
Provides a jackknife-based confidence interval for Kendall's coefficient of concordance, W.
itemeffects
When GLMM-based statistics are requested, displays estimated random effects for each item.
ratereffects
When GLMM-based statistics are requested, displays estimated random effects for each rater.
DETAILS:
The methodology implemented by the MAGREE macro is presented in Vanbelle (2018), Fleiss (1981, 2003), Gwet (2008 and agreestat.com), Fleiss et. al. (1979), Kendall (1955), and Nelson and Edwards (2015, 2018). See these references for details. For the kappa and Gwet statistics, the MAGREE macro assumes that the response rating scale is categorical and that each possible value of the scale is observed at least once. That is, if the raters rate the items on a scale with k levels, but only k-p levels appear in the data, then these agreement statistics are appropriate for assessing rater agreement on a scale with only k-p levels.

Fleiss' kappa

Fleiss' kappa statistic is provided for the multirater case. This statistic is available for nominal responses (or ordinal responses to be treated as nominal) and may have numeric or character values. Raters may rate each item no more than once. If all items are rated by all raters exactly once, estimates of kappa (overall, and for each response category when the percategory option is specified) are provided along with tests and confidence intervals. If some raters do not rate some items, kappa estimates and confidence intervals (Vanbelle, 2018) are still available but tests cannot be provided. As noted by Fleiss et. al. (1979) concerning the multiple-rater kappa statistic, the raters for a given item "are not necessarily the same as those rating another."

Note that Fleiss' statistic does not reduce to Cohen's kappa statistic (as provided by the AGREE option in PROC FREQ) for the case of only two raters. As described by Gwet (2018), this is because the Fleiss statistic is really an extension of Scott's π statistic rather than of Cohen's kappa.

Landis and Koch (1977) attempted to qualify the degree of agreement that exists when kappa is found to be in various ranges. However, the practical meaning of any given value of kappa is likely to depend on the area of study. Note that the maximum value of kappa is one indicating complete agreement. Kappa can be negative when there is no agreement.

Kappa value Degree of agreement
≤ 0 Poor
0 - 0.2 Slight
0.2 - 0.4 Fair
0.4 - 0.6 Moderate
0.6 - 0.8 Substantial
0.8 - 1 Almost perfect

Unlike the AGREE option in PROC FREQ for the two rater case, weighted kappa is not available in this macro for the multirater case.

Gwet's AC1 and AC2

Gwet's agreement coefficients, AC1 or AC2, are available for nominal (unordered) and ordinal responses. For a nominal response, AC1 can be used to estimate agreement among the raters. When the response is ordinal and a weight matrix is provided (see Weights below), interrater agreement is estimated by the AC2 statistic. The maximum value of these agreement coefficients is one indicating perfect agreement. The value can be negative when there is no agreement.

Gwet developed three different standard error estimates and these are provided along with a test and confidence interval for each. The first is the standard error when raters are considered fixed and items are sampled. This is useful when you are interested in estimating or testing agreement among just the raters that were observed. The second is the standard error when items are considered fixed and raters are sampled. Interest is then in rater agreement involving only the observed items. Finally, the third standard error is fully unconditional and treats both raters and items as sampled. This allows agreement to be assessed over the entire populations of raters and items. By default, a large-sample test and confidence interval are provided. If the percategory option is specified, these statistics are also provided for each response category.

Unlike Fleiss' kappa, tests and confidence intervals based on all three standard errors are available whether all items are rated by all raters or not. Items can have unequal numbers of ratings. However, the fixed-items and unconditional standard errors require at least three raters.

When there are only two raters, the AC1 statistic is the same as computed by the AGREE(AC1) option in PROC FREQ in SAS 9.4 TS1M4 or later. However, the standard error estimates differ because different estimation methods are used. Gwet (2008) provides the estimator for the fixed-rater variance. A jackknife approach is used to estimate the fixed-items variance. The unconditional variance is the sum of the two conditional variances.

GLMM-based κm and κma

Nelson and Edwards' (2015, 2018) statistics based on a generalized linear mixed models (GLMM) are available for ordinal response data. SAS/STAT® and SAS/IML® are required. Raters are not required to rate all items. They note that, unlike some other measures, their statistics are invariant to the prevalence of the condition measured by the ordinal response. The measures and inferences based on them apply to the rater and item populations from which they are randomly selected. Raters are assumed to independently rate each item. A cumulative probit GLMM with random effects for raters and items is fitted assuming independent and normally distributed random effects. The measures are defined using the rater and item variance estimates from the model. Exact agreement among the raters is measured by κm. Partial agreement or association, in which rater differences of varying size contribute partially to the measure as defined by a weight matrix (see Weights below), is assessed by the κma statistic. Both statistics range from zero to one, with values near zero indicating low agreement and values near one indicating strong agreement (after correcting for chance).

The time needed to fit the GLMM model increases as the numbers of raters and items increase and might become infeasible for large numbers of either.

The macro provides estimates of both measures along with confidence limits. With this model-based approach, estimates of the random effects for raters and items (ratereffects and itemeffects options) are available which you can use to investigate differences among the individual raters or items. A large positive random effect for a rater indicates that the rater tends to rate items higher than other raters. A large negative effect indicates a tendency to rate items lower than other raters. Similarly for items, a large effect (positive or negative) suggests that the item is rated more toward one end (higher or lower) of the response scale.

Also displayed is an estimate of ρ (rho) which essentially measures the amount of total variability in the ratings attributable to the items and ranges between 0 and 1. A value of ρ near one indicates that most of the variability is due to item variability. High item variability suggests that the items are relatively distinct and easier for raters to distinguish. Nelson and Edwards suggest that other agreement measures might be liberal (overly large) in this situation. A value of ρ near zero indicates that most of the variability is among the raters rather than items. When the variability among the items is relatively small, this could suggest that items are not as distinct making it difficult for raters to distinguish them.

See the cited papers for details and the results of simulation studies exploring the performance of these statistics.

Weights

You can specify weight= to request use of a computed weight matrix or to specify your own matrix to be used in the computation of Gwet's AC2 and/or the GLMM-based κma.

The values in the weight matrix can be thought of as the amount of agreement for differences of various sizes among the ordered response categories. Typically, a difference of one category (such as a rating of 1 vs 2) is considered less serious (or more in agreement) than a difference of more than one category (such as 1 vs 5). Consequently, values close to the main diagonal (where 1 represents complete agreement) of the weight matrix typically are large and then diminish to zero at the far, off-diagonal corners. Note that the weight matrix should be symmetric, must be square with the number of rows and columns equal to the number of response categories, must have all diagonal elements equal to 1, and must contain no negative values. The unweighted AC1 statistic, used when treating the response categories as nominal, can be thought of as considering all differences of any size as equally serious.

Predefined weight matrices – linear, quadratic, square root, and identity – are available or you can provide your own. With the identity matrix, the AC2 results are the same as the unweighted AC1 results. Alternatively, you can provide a power value for defining those matrices as well as matrices based on a wider range of powers. For example, weight=1.5 defines a weight matrix between linear and quadratic. When a weight matrix is specified and the response is character valued, the character values in sorted order are assigned the equally-spaced numeric values 1, 2, 3, and so on. For κma, values 1, 2, 3, ... are always used whether the response is character or numeric. When the response variable is numeric, the response variable values are used in the computation of the weights for the AC2 statistic. This allows the weights to reflect categories that are not equally spaced. To treat the categories as equally-spaced, assign values to the categories that are equally-spaced, or assign character values to the categories. For categories with values j and k, weights are computed as 1-[(j-k)/(max-min)]p, where max and min are the maximum and minimum category values and p is the numeric value specified in weight=.

To provide more control on the distribution of weights across the category pairs, the weights can be computed based on quantiles of an exponential distribution. By specifying the parameter of the exponential distribution in wtparm= (weight= must also be specified), the values from the above weight formula are used as probabilities for computing quantiles from an exponential distribution with the specified parameter. These quantiles become the weights. Increasing values of the parameter, usually in the range 0.1 to 2, tends to reduce the weights particularly toward the off-diagonal corners of the weight matrix.

When Gwet statistics are requested and the percategory option is specified to assess agreement on each category, the response is dichotomized such that all categories other than the one in question are combined. As a result, ordinality no longer applies and weights are not used. The percategory option is not available with GLMM-based statistics.

Kendall's W

Kendall's coefficient of concordance, W, is a nonparametric statistic related to Friedman's test and tests agreement of the raters' rankings of the items. It is available when the response is ordinal and numerically coded. SAS/STAT® is required. Since the macro ranks the raters' values, the response can be continuous. Mid-ranks are used when ties occur. All raters must rate all items exactly once. A large-sample test is provided by default. The jackknife-based estimate of the standard deviation of W (Grothe and Schmid, 2011), and confidence limits using this estimate, are provided with the wcl option. Grothe and Schmid (2011) discuss coverage probabilities of this interval based on their simulation results. They indicate that this estimate improves with increasing number of items and is useful when the number of items exceeds 10 or 20. Since the response may be continuous and not have a fixed set of discrete levels, agreement for individual categories is not possible.

BY group processing

While the MAGREE macro does not directly support BY group processing, this capability can be provided by the RunBY macro which can run the MAGREE macro repeatedly for each of the BY groups in your data. See the RunBY macro documentation for details on its use. Also see the example titled "BY group processing" in the Results tab above.

Output data sets

Values of all statistics, their standard errors, confidence intervals, and p-values are available in data sets after execution of the macro. The kappa estimates and related statistics are in the data set _KAPPAS. Kendall's coefficient and related statistic are in the data set _CONCORD. Gwet's agreement coefficient and related statistics are saved in data set _GAC. The GLMM-based statistics are saved in data set _GLMM. If the ratereffects option is specified, the results are in data set _GLMMVI. If the itemeffects option is specified, the results are in data set _GLMMUI.

LIMITATIONS:
There must be more than one rater and more than one item to compute any of the statistics. Also, no statistics are computed if any rater rates any item more than once. The set of values in the raters= variable must be the same for all items, even if different raters are used for different items (which is only valid for the kappa and Gwet statistics). See MISSING VALUES below. Kappa, Gwet, and GLMM-based statistics require that each item have zero or one rating from each rater. Items may have unequal numbers of ratings, but in that case only confidence intervals, not tests, are available for kappa. However, in the case of unequal numbers of ratings per item, both confidence intervals and tests are available for Gwet's statistics. Kendall's statistic requires that the response variable be numeric and ordinal and that all items have exactly one rating from all raters.
MISSING VALUES:
Observations with missing values in the response=, items=, or raters= variable are removed before the analysis. Error messages are issued and no statistics are computed if any rater rates an item more than once. If some raters do not rate some items, the Kendall statistic and tests for kappa statistics are not computed. However, the GLMM-based statistics and confidence intervals for Fleiss' kappa are provided in this case. While Gwet's statistic is provided even if some raters do not rate some items, at least three raters are required to estimate the fixed-items and unconditional standard errors.
SEE ALSO:
The FREQ procedure in Base SAS software can compute simple and weighted kappa statistics for comparing two raters. Beginning in SAS 9.4 TS1M4, it can also provide Gwet's AC1 statistic for the two rater case. Use the AGREE or AGREE(AC1) option in the TABLES statement.
REFERENCES:
Fleiss J.L., Levin B., Paik M.C. (2003), Statistical Methods for Rates and Proportions, 3rd ed., Hoboken, NJ: John Wiley & Sons, Inc.

Fleiss, J.L. (1981), Statistical Methods for Rates and Proportions, 2nd ed., New York: John Wiley & Sons, Inc.

Fleiss, J.L., Nee, J.C.M, and Landis, J.R. (1979), "Large Sample Variance of Kappa in the Case of Different Sets of Raters," Psychological Bulletin, 86(5), 974-977.

Grothe O. and Schmid F. (2011), "Kendall's W Reconsidered," Communications in Statistics – Simulation and Computation, 40: 285-305.

Gwet, K.L. (2008), "Computing inter-rater reliability and its variance in the presence of high agreement," British Journal of Mathematical and Statistical Psychology, 61, 29-48.

Kendall, M.G. (1955), Rank Correlation Methods, Second Edition, London: Charles Griffin & Co. Ltd.

Landis, J.R. and Koch G.G. (1977), "The measurement of observer agreement for categorical data," Biometrics, 33, 159-174.

Nelson, K.P. and Edwards D. (2015), "Measures of agreement between many raters for ordinal classifications," Stat Med., 34(23), 3116-3132.

Nelson, K.P. and Edwards D. (2018), "A measure of association for ordered categorical data in population-based studies." Stat Methods Med. Res., 27(3), 812-831.

Vanbelle, S. (2018), "Asymptotic variability of (multilevel) multirater kappa coefficients," Statistical Methods in Medical Research.




These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.