![]() | ![]() | ![]() | ![]() | ![]() | ![]() |
Contents: | Purpose / History / Requirements / Usage / Details / Limitations / Missing Values / See Also / References |
NOTE: For assessing agreement between two raters, use the AGREE option in PROC FREQ.
%magree(v)
The MAGREE macro always attempts to check for a later version of itself. If it is unable to do this (such as if there is no active Internet connection available), the macro will issue the following message:
NOTE: Unable to check for newer version of MAGREE macro.
The computations performed by the macro are not affected by the appearance of this message. However, this check can be avoided by specifying nochk as the first macro argument. This can be useful if your machine has no connection to the internet.
Version
|
Update Notes
|
3.91 | Added check for nonpositive definite matrix for GLMM statistics. Fixed sign on item, rater effects. nochk can be specified as the first (version) parameter. |
3.9 | Improved error checking and messaging. |
3.8 | Added check for zero variance in Gwet statistics. |
3.7 | Added nominal and ordinal as valid values in stat=. |
3.6 | Added GLMM-based κm and κma (Nelson and Edwards, 2015, 2018). Changed stat= to accept a list. Added itemeffect and ratereffect options. |
3.3 | Added wcl option. |
3.2 | Added percategory option. Category kappas no longer displayed by default. |
3.1 | Suppress display of fixed-items and unconditional rows in Gwet results with fewer than three raters. Added (no)fixrateronly option. |
3.0 | Added Gwet's agreement coefficients. Added options noprint, noweights, and counts. Former nobaltab option changed to nosummary. |
2.1 | Fixed minor problem when name in items=, raters=, or response= differs in case from name in data. |
2.0 | Confidence intervals are provided for kappa (Vanbelle, 2018). Added alpha= option. Allow computation of kappa and confidence intervals (but not tests) with unequal number of ratings per item. Added optional table of rating counts across items. |
1.3 | Fixes errors caused by using the name COUNT or PERCENT for the items=, raters=, or response= variable. |
1.2 | Corrected cases of perfect agreement. Require more than one item and rater. Added check for new version, reset of _last_ data set and notes option at end. |
1.1 | Added error checking for parameters, data set, and variables. Added version option. |
%inc "<location of your file containing the MAGREE macro>";
Following this statement, you may call the MAGREE macro. See the Results tab for an example.
The following parameters are required:
The following parameters are optional:
Fleiss' kappa
Fleiss' kappa statistic is provided for the multirater case. This statistic is available for nominal responses (or ordinal responses to be treated as nominal) and may have numeric or character values. Raters may rate each item no more than once. If all items are rated by all raters exactly once, estimates of kappa (overall, and for each response category when the percategory option is specified) are provided along with tests and confidence intervals. If some raters do not rate some items, kappa estimates and confidence intervals (Vanbelle, 2018) are still available but tests cannot be provided. As noted by Fleiss et. al. (1979) concerning the multiple-rater kappa statistic, the raters for a given item "are not necessarily the same as those rating another."
Note that Fleiss' statistic does not reduce to Cohen's kappa statistic (as provided by the AGREE option in PROC FREQ) for the case of only two raters. As described by Gwet (2018), this is because the Fleiss statistic is really an extension of Scott's π statistic rather than of Cohen's kappa.
Landis and Koch (1977) attempted to qualify the degree of agreement that exists when kappa is found to be in various ranges. However, the practical meaning of any given value of kappa is likely to depend on the area of study. Note that the maximum value of kappa is one indicating complete agreement. Kappa can be negative when there is no agreement.
Kappa value | Degree of agreement |
≤ 0 | Poor |
0 - 0.2 | Slight |
0.2 - 0.4 | Fair |
0.4 - 0.6 | Moderate |
0.6 - 0.8 | Substantial |
0.8 - 1 | Almost perfect |
Unlike the AGREE option in PROC FREQ for the two rater case, weighted kappa is not available in this macro for the multirater case.
Gwet's AC1 and AC2
Gwet's agreement coefficients, AC1 or AC2, are available for nominal (unordered) and ordinal responses. For a nominal response, AC1 can be used to estimate agreement among the raters. When the response is ordinal and a weight matrix is provided (see Weights below), interrater agreement is estimated by the AC2 statistic. The maximum value of these agreement coefficients is one indicating perfect agreement. The value can be negative when there is no agreement.
Gwet developed three different standard error estimates and these are provided along with a test and confidence interval for each. The first is the standard error when raters are considered fixed and items are sampled. This is useful when you are interested in estimating or testing agreement among just the raters that were observed. The second is the standard error when items are considered fixed and raters are sampled. Interest is then in rater agreement involving only the observed items. Finally, the third standard error is fully unconditional and treats both raters and items as sampled. This allows agreement to be assessed over the entire populations of raters and items. By default, a large-sample test and confidence interval are provided. If the percategory option is specified, these statistics are also provided for each response category.
Unlike Fleiss' kappa, tests and confidence intervals based on all three standard errors are available whether all items are rated by all raters or not. Items can have unequal numbers of ratings. However, the fixed-items and unconditional standard errors require at least three raters.
When there are only two raters, the AC1 statistic is the same as computed by the AGREE(AC1) option in PROC FREQ in SAS 9.4 TS1M4 or later. However, the standard error estimates differ because different estimation methods are used. Gwet (2008) provides the estimator for the fixed-rater variance. A jackknife approach is used to estimate the fixed-items variance. The unconditional variance is the sum of the two conditional variances.
GLMM-based κm and κma
Nelson and Edwards' (2015, 2018) statistics based on a generalized linear mixed models (GLMM) are available for ordinal response data. SAS/STAT® and SAS/IML® are required. Raters are not required to rate all items. They note that, unlike some other measures, their statistics are invariant to the prevalence of the condition measured by the ordinal response. The measures and inferences based on them apply to the rater and item populations from which they are randomly selected. Raters are assumed to independently rate each item. A cumulative probit GLMM with random effects for raters and items is fitted assuming independent and normally distributed random effects. The measures are defined using the rater and item variance estimates from the model. Exact agreement among the raters is measured by κm. Partial agreement or association, in which rater differences of varying size contribute partially to the measure as defined by a weight matrix (see Weights below), is assessed by the κma statistic. Both statistics range from zero to one, with values near zero indicating low agreement and values near one indicating strong agreement (after correcting for chance).
The time needed to fit the GLMM model increases as the numbers of raters and items increase and might become infeasible for large numbers of either.
The macro provides estimates of both measures along with confidence limits. With this model-based approach, estimates of the random effects for raters and items (ratereffects and itemeffects options) are available which you can use to investigate differences among the individual raters or items. A large positive random effect for a rater indicates that the rater tends to rate items higher than other raters. A large negative effect indicates a tendency to rate items lower than other raters. Similarly for items, a large effect (positive or negative) suggests that the item is rated more toward one end (higher or lower) of the response scale.
Also displayed is an estimate of ρ (rho) which essentially measures the amount of total variability in the ratings attributable to the items and ranges between 0 and 1. A value of ρ near one indicates that most of the variability is due to item variability. High item variability suggests that the items are relatively distinct and easier for raters to distinguish. Nelson and Edwards suggest that other agreement measures might be liberal (overly large) in this situation. A value of ρ near zero indicates that most of the variability is among the raters rather than items. When the variability among the items is relatively small, this could suggest that items are not as distinct making it difficult for raters to distinguish them.
See the cited papers for details and the results of simulation studies exploring the performance of these statistics.
Weights
You can specify weight= to request use of a computed weight matrix or to specify your own matrix to be used in the computation of Gwet's AC2 and/or the GLMM-based κma.
The values in the weight matrix can be thought of as the amount of agreement for differences of various sizes among the ordered response categories. Typically, a difference of one category (such as a rating of 1 vs 2) is considered less serious (or more in agreement) than a difference of more than one category (such as 1 vs 5). Consequently, values close to the main diagonal (where 1 represents complete agreement) of the weight matrix typically are large and then diminish to zero at the far, off-diagonal corners. Note that the weight matrix should be symmetric, must be square with the number of rows and columns equal to the number of response categories, must have all diagonal elements equal to 1, and must contain no negative values. The unweighted AC1 statistic, used when treating the response categories as nominal, can be thought of as considering all differences of any size as equally serious.
Predefined weight matrices – linear, quadratic, square root, and identity – are available or you can provide your own. With the identity matrix, the AC2 results are the same as the unweighted AC1 results. Alternatively, you can provide a power value for defining those matrices as well as matrices based on a wider range of powers. For example, weight=1.5 defines a weight matrix between linear and quadratic. When a weight matrix is specified and the response is character valued, the character values in sorted order are assigned the equally-spaced numeric values 1, 2, 3, and so on. For κma, values 1, 2, 3, ... are always used whether the response is character or numeric. When the response variable is numeric, the response variable values are used in the computation of the weights for the AC2 statistic. This allows the weights to reflect categories that are not equally spaced. To treat the categories as equally-spaced, assign values to the categories that are equally-spaced, or assign character values to the categories. For categories with values j and k, weights are computed as 1-[(j-k)/(max-min)]p, where max and min are the maximum and minimum category values and p is the numeric value specified in weight=.
To provide more control on the distribution of weights across the category pairs, the weights can be computed based on quantiles of an exponential distribution. By specifying the parameter of the exponential distribution in wtparm= (weight= must also be specified), the values from the above weight formula are used as probabilities for computing quantiles from an exponential distribution with the specified parameter. These quantiles become the weights. Increasing values of the parameter, usually in the range 0.1 to 2, tends to reduce the weights particularly toward the off-diagonal corners of the weight matrix.
When Gwet statistics are requested and the percategory option is specified to assess agreement on each category, the response is dichotomized such that all categories other than the one in question are combined. As a result, ordinality no longer applies and weights are not used. The percategory option is not available with GLMM-based statistics.
Kendall's W
Kendall's coefficient of concordance, W, is a nonparametric statistic related to Friedman's test and tests agreement of the raters' rankings of the items. It is available when the response is ordinal and numerically coded. SAS/STAT® is required. Since the macro ranks the raters' values, the response can be continuous. Mid-ranks are used when ties occur. All raters must rate all items exactly once. A large-sample test is provided by default. The jackknife-based estimate of the standard deviation of W (Grothe and Schmid, 2011), and confidence limits using this estimate, are provided with the wcl option. Grothe and Schmid (2011) discuss coverage probabilities of this interval based on their simulation results. They indicate that this estimate improves with increasing number of items and is useful when the number of items exceeds 10 or 20. Since the response may be continuous and not have a fixed set of discrete levels, agreement for individual categories is not possible.
While the MAGREE macro does not directly support BY group processing, this capability can be provided by the RunBY macro which can run the MAGREE macro repeatedly for each of the BY groups in your data. See the RunBY macro documentation for details on its use. Also see the example titled "BY group processing" in the Results tab above.
Output data sets
Values of all statistics, their standard errors, confidence intervals, and p-values are available in data sets after execution of the macro. The kappa estimates and related statistics are in the data set _KAPPAS. Kendall's coefficient and related statistic are in the data set _CONCORD. Gwet's agreement coefficient and related statistics are saved in data set _GAC. The GLMM-based statistics are saved in data set _GLMM. If the ratereffects option is specified, the results are in data set _GLMMVI. If the itemeffects option is specified, the results are in data set _GLMMUI.
Fleiss, J.L. (1981), Statistical Methods for Rates and Proportions, 2nd ed., New York: John Wiley & Sons, Inc.
Fleiss, J.L., Nee, J.C.M, and Landis, J.R. (1979), "Large Sample Variance of Kappa in the Case of Different Sets of Raters," Psychological Bulletin, 86(5), 974-977.
Grothe O. and Schmid F. (2011), "Kendall's W Reconsidered," Communications in Statistics – Simulation and Computation, 40: 285-305.
Gwet, K.L. (2008), "Computing inter-rater reliability and its variance in the presence of high agreement," British Journal of Mathematical and Statistical Psychology, 61, 29-48.
Kendall, M.G. (1955), Rank Correlation Methods, Second Edition, London: Charles Griffin & Co. Ltd.
Landis, J.R. and Koch G.G. (1977), "The measurement of observer agreement for categorical data," Biometrics, 33, 159-174.
Nelson, K.P. and Edwards D. (2015), "Measures of agreement between many raters for ordinal classifications," Stat Med., 34(23), 3116-3132.
Nelson, K.P. and Edwards D. (2018), "A measure of association for ordered categorical data in population-based studies." Stat Methods Med. Res., 27(3), 812-831.
Vanbelle, S. (2018), "Asymptotic variability of (multilevel) multirater kappa coefficients," Statistical Methods in Medical Research.
These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.