25006 - Compute estimates and tests of agreement among multiple raters

Sample 25006: Compute estimates and tests of agreement among multiple raters

Compute estimates and tests of agreement among multiple raters

Contents:

Purpose / History / Requirements / Usage / Details / Limitations / Missing Values / See Also / References

NOTE: For assessing agreement between two raters, use the AGREE option in PROC FREQ.

PURPOSE:

Compute estimates and tests of agreement among multiple raters when responses (ratings) are on a nominal or ordinal scale. For a nominal response, kappa and Gwet's AC₁ agreement coefficient are available. For an ordinal response, Gwet's weighted agreement coefficient (AC₂) and statistics based on a cumulative probit model with random effects for raters and items are available. If the response is numerically-coded (and possibly continuous), Kendall's coefficient of concordance is also available.

HISTORY:

The version of the MAGREE macro that you are using is displayed when you specify anything as the first macro argument. For example:

    %magree(v)

The MAGREE macro always attempts to check for a later version of itself. If it is unable to do this (such as if there is no active Internet connection available), the macro will issue the following message:

    NOTE: Unable to check for newer version of MAGREE macro.

The computations performed by the macro are not affected by the appearance of this message. However, this check can be avoided by specifying nochk as the first macro argument. This can be useful if your machine has no connection to the internet.

Version	Update Notes
3.92	Fixed syntax and other errors when required products not found.
3.91	Added check for nonpositive definite matrix for GLMM statistics. Fixed sign on item, rater effects. nochk can be specified as the first (version) parameter.
3.9	Improved error checking and messaging.
3.8	Added check for zero variance in Gwet statistics.
3.7	Added nominal and ordinal as valid values in stat=.
3.6	Added GLMM-based κ_m and κ_ma (Nelson and Edwards, 2015, 2018). Changed stat= to accept a list. Added itemeffect and ratereffect options.
3.3	Added wcl option.
3.2	Added percategory option. Category kappas no longer displayed by default.
3.1	Suppress display of fixed-items and unconditional rows in Gwet results with fewer than three raters. Added (no)fixrateronly option.
3.0	Added Gwet's agreement coefficients. Added options noprint, noweights, and counts. Former nobaltab option changed to nosummary.
2.1	Fixed minor problem when name in items=, raters=, or response= differs in case from name in data.
2.0	Confidence intervals are provided for kappa (Vanbelle, 2018). Added alpha= option. Allow computation of kappa and confidence intervals (but not tests) with unequal number of ratings per item. Added optional table of rating counts across items.
1.3	Fixes errors caused by using the name COUNT or PERCENT for the items=, raters=, or response= variable.
1.2	Corrected cases of perfect agreement. Require more than one item and rater. Added check for new version, reset of _last_ data set and notes option at end.
1.1	Added error checking for parameters, data set, and variables. Added version option.

REQUIREMENTS:

Only Base SAS^® is required for kappa and Gwet's statistics. SAS/STAT^® is required for the Kendall and GLMM-based statistics. SAS/IML^® is required for GLMM-based statistics.

USAGE:

Follow the instructions in the Downloads tab of this sample to save the MAGREE macro definition. Replace the text within quotation marks in the following statement with the location of the MAGREE macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the MAGREE macro and make it available for use:

   %inc "<location of your file containing the MAGREE macro>";

Following this statement, you may call the MAGREE macro. See the Results tab for an example.

The following parameters are required:

items=variable: Specifies the variable containing item (or subject) identifiers. This variable may be numeric or character. There must be more than one item.
raters=variable: Specifies the variable containing rater identifiers. This variable may be numeric or character. The same rater values must be used for all items even if items are rated by different raters. Note that Kendall's statistic is only valid when all items are rated by the same raters. There must be more than one rater.
response=variable: Specifies the variable containing the ratings. This variable may be numeric or character. If it is character, it is assumed to be nominal and only kappa and Gwet's statistics are computed. If it is numeric, any of the statistics can be computed, but note that the Kendall and GLMM-based statistics are only valid for ordinal responses. Any format assigned to the response variable is ignored.

The following parameters are optional:

data=data-set-name

Identifies the input data set to be analyzed. If not specified, the last-created data set is used. It should contain one observation per rating – that is, each observation should contain one rating on one item by one rater with variables containing item and rater identifiers and a variable containing the rating (response). The same set of rater values must be used for each item, even if items are rated by different raters (which is only valid for kappa and Gwet statistics).

stat=all|nominal | ordinal | statistics

Specifies the statistics to calculate. Specify nominal to request all relevant statistics for a nominal scale response, or specify ordinal to request all relevant statistics for an ordinal scale response, or specify one or more particular statistics separated by spaces. Statistics that you can specify include the following. Specify kappa to compute Fleiss' kappa statistics for a nominal response. Specify gwet to compute Gwet's AC₁ statistic for a nominal response, or Gwet's AC₂ statistic for an ordinal response if weight= is also specified. Specify kendall to compute Kendall's coefficient of concordance for an ordinal or continuous response. Specify glmm to compute the GLMM-based κ_m and κ_ma statistics for an ordinal response. Note that the Kendall and GLMM-based statistics are not computed if the response= variable is a character variable. Use all to compute all statistics (if possible). By default, stat=all.

alpha=value

Specifies the alpha level for confidence intervals with confidence level equal to 1-value. Value must be between 0 and 1. The default is alpha=0.05.

Specifies the structure of a symmetric weight matrix to be computed for use in Gwet's AC₂ statistic and the GLMM-based κ_ma statistic for ordinal responses. The Kendall and Kappa statistics are not computed when weight= is specified. See Weights below for details. Value, if specified, must be between 0.01 and 5. Specifying weight=linear is equivalent to weight=1 as are weight=2 and weight=quadratic and weight=sqrt and weight=0.5. When weight=data-set-name, the specified data set must have only k observations and k numeric variables containing the weight matrix. Any names can be used for the variables in the data set. weight= is ignored unless gwet, glmm, or all is specified in stat=. If omitted, Gwet's AC₁ is computed when the Gwet statistic is requested, and κ_m is computed when the GLMM-based statistics are requested.

wtparm=value

Specifies the parameter of the exponential distribution that can optionally be used in the computation of a symmetric weight matrix for Gwet's AC₂ statistic or the GLMM-based κ_ma statistic for ordinal responses. See Weights below for details. If specified, value must be greater than or equal to 0.01. Ignored unless weight= and either gwet, glmm, or all is specified in stat=. Also ignored if a data set name is specified in weight=.

options=options

One or more of the following options can be specified, separated by spaces.

counts: Displays a table showing the number of ratings in each category for each item.
noprint: Suppresses all displayed results. See Output data sets below.
nosummary: Suppresses display of the table showing the number of items that have specific numbers of ratings.
noweights: Suppresses display of the weight matrix, if specified.
t: Tests and confidence limits for Gwet's statistics and confidence limits for Kendall's W are computed using the t distribution. By default, or with option z, large-sample values are computed using the standard normal distribution.
z: Tests and confidence limits for Gwet's statistics and confidence limits for Kendall's W are computed using the standard normal distribution. This is the default.
fixrateronly: Suppresses computation of the fixed-items and unconditional variances and related Gwet statistics. This avoids the iterative jackknife method that becomes more time consuming as the number of items increases.
percategory: For Fleiss' kappa and Gwet's agreement coefficients, provides statistics for assessing agreement on each response category. Category-level agreement does not apply to the Kendall and GLMM-based statistics.
wcl: Provides a jackknife-based confidence interval for Kendall's coefficient of concordance, W.
itemeffects: When GLMM-based statistics are requested, displays estimated random effects for each item.
ratereffects: When GLMM-based statistics are requested, displays estimated random effects for each rater.

DETAILS:

The methodology implemented by the MAGREE macro is presented in Vanbelle (2018), Fleiss (1981, 2003), Gwet (2008 and agreestat.com), Fleiss et. al. (1979), Kendall (1955), and Nelson and Edwards (2015, 2018). See these references for details. For the kappa and Gwet statistics, the MAGREE macro assumes that the response rating scale is categorical and that each possible value of the scale is observed at least once. That is, if the raters rate the items on a scale with k levels, but only k-p levels appear in the data, then these agreement statistics are appropriate for assessing rater agreement on a scale with only k-p levels.

Fleiss' kappa

Fleiss' kappa statistic is provided for the multirater case. This statistic is available for nominal responses (or ordinal responses to be treated as nominal) and may have numeric or character values. Raters may rate each item no more than once. If all items are rated by all raters exactly once, estimates of kappa (overall, and for each response category when the percategory option is specified) are provided along with tests and confidence intervals. If some raters do not rate some items, kappa estimates and confidence intervals (Vanbelle, 2018) are still available but tests cannot be provided. As noted by Fleiss et. al. (1979) concerning the multiple-rater kappa statistic, the raters for a given item "are not necessarily the same as those rating another."

Note that Fleiss' statistic does not reduce to Cohen's kappa statistic (as provided by the AGREE option in PROC FREQ) for the case of only two raters. As described by Gwet (2018), this is because the Fleiss statistic is really an extension of Scott's π statistic rather than of Cohen's kappa.

Landis and Koch (1977) attempted to qualify the degree of agreement that exists when kappa is found to be in various ranges. However, the practical meaning of any given value of kappa is likely to depend on the area of study. Note that the maximum value of kappa is one indicating complete agreement. Kappa can be negative when there is no agreement.

Kappa value	Degree of agreement
≤ 0	Poor
0 - 0.2	Slight
0.2 - 0.4	Fair
0.4 - 0.6	Moderate
0.6 - 0.8	Substantial
0.8 - 1	Almost perfect

Unlike the AGREE option in PROC FREQ for the two rater case, weighted kappa is not available in this macro for the multirater case.

Gwet's AC₁ and AC₂

Gwet's agreement coefficients, AC₁ or AC₂, are available for nominal (unordered) and ordinal responses. For a nominal response, AC₁ can be used to estimate agreement among the raters. When the response is ordinal and a weight matrix is provided (see Weights below), interrater agreement is estimated by the AC₂ statistic. The maximum value of these agreement coefficients is one indicating perfect agreement. The value can be negative when there is no agreement.

Gwet developed three different standard error estimates and these are provided along with a test and confidence interval for each. The first is the standard error when raters are considered fixed and items are sampled. This is useful when you are interested in estimating or testing agreement among just the raters that were observed. The second is the standard error when items are considered fixed and raters are sampled. Interest is then in rater agreement involving only the observed items. Finally, the third standard error is fully unconditional and treats both raters and items as sampled. This allows agreement to be assessed over the entire populations of raters and items. By default, a large-sample test and confidence interval are provided. If the percategory option is specified, these statistics are also provided for each response category.

Unlike Fleiss' kappa, tests and confidence intervals based on all three standard errors are available whether all items are rated by all raters or not. Items can have unequal numbers of ratings. However, the fixed-items and unconditional standard errors require at least three raters.

When there are only two raters, the AC₁ statistic is the same as computed by the AGREE(AC1) option in PROC FREQ in SAS 9.4 TS1M4 or later. However, the standard error estimates differ because different estimation methods are used. Gwet (2008) provides the estimator for the fixed-rater variance. A jackknife approach is used to estimate the fixed-items variance. The unconditional variance is the sum of the two conditional variances.

GLMM-based κ_m and κ_ma

Nelson and Edwards' (2015, 2018) statistics based on a generalized linear mixed models (GLMM) are available for ordinal response data. SAS/STAT^® and SAS/IML^® are required. Raters are not required to rate all items. They note that, unlike some other measures, their statistics are invariant to the prevalence of the condition measured by the ordinal response. The measures and inferences based on them apply to the rater and item populations from which they are randomly selected. Raters are assumed to independently rate each item. A cumulative probit GLMM with random effects for raters and items is fitted assuming independent and normally distributed random effects. The measures are defined using the rater and item variance estimates from the model. Exact agreement among the raters is measured by κ_m. Partial agreement or association, in which rater differences of varying size contribute partially to the measure as defined by a weight matrix (see Weights below), is assessed by the κ_ma statistic. Both statistics range from zero to one, with values near zero indicating low agreement and values near one indicating strong agreement (after correcting for chance).

The time needed to fit the GLMM model increases as the numbers of raters and items increase and might become infeasible for large numbers of either.

The macro provides estimates of both measures along with confidence limits. With this model-based approach, estimates of the random effects for raters and items (ratereffects and itemeffects options) are available which you can use to investigate differences among the individual raters or items. A large positive random effect for a rater indicates that the rater tends to rate items higher than other raters. A large negative effect indicates a tendency to rate items lower than other raters. Similarly for items, a large effect (positive or negative) suggests that the item is rated more toward one end (higher or lower) of the response scale.

Also displayed is an estimate of ρ (rho) which essentially measures the amount of total variability in the ratings attributable to the items and ranges between 0 and 1. A value of ρ near one indicates that most of the variability is due to item variability. High item variability suggests that the items are relatively distinct and easier for raters to distinguish. Nelson and Edwards suggest that other agreement measures might be liberal (overly large) in this situation. A value of ρ near zero indicates that most of the variability is among the raters rather than items. When the variability among the items is relatively small, this could suggest that items are not as distinct making it difficult for raters to distinguish them.

See the cited papers for details and the results of simulation studies exploring the performance of these statistics.

Weights

You can specify weight= to request use of a computed weight matrix or to specify your own matrix to be used in the computation of Gwet's AC₂ and/or the GLMM-based κ_ma.

The values in the weight matrix can be thought of as the amount of agreement for differences of various sizes among the ordered response categories. Typically, a difference of one category (such as a rating of 1 vs 2) is considered less serious (or more in agreement) than a difference of more than one category (such as 1 vs 5). Consequently, values close to the main diagonal (where 1 represents complete agreement) of the weight matrix typically are large and then diminish to zero at the far, off-diagonal corners. Note that the weight matrix should be symmetric, must be square with the number of rows and columns equal to the number of response categories, must have all diagonal elements equal to 1, and must contain no negative values. The unweighted AC₁ statistic, used when treating the response categories as nominal, can be thought of as considering all differences of any size as equally serious.

Predefined weight matrices – linear, quadratic, square root, and identity – are available or you can provide your own. With the identity matrix, the AC₂ results are the same as the unweighted AC₁ results. Alternatively, you can provide a power value for defining those matrices as well as matrices based on a wider range of powers. For example, weight=1.5 defines a weight matrix between linear and quadratic. When a weight matrix is specified and the response is character valued, the character values in sorted order are assigned the equally-spaced numeric values 1, 2, 3, and so on. For κ_ma, values 1, 2, 3, ... are always used whether the response is character or numeric. When the response variable is numeric, the response variable values are used in the computation of the weights for the AC₂ statistic. This allows the weights to reflect categories that are not equally spaced. To treat the categories as equally-spaced, assign values to the categories that are equally-spaced, or assign character values to the categories. For categories with values j and k, weights are computed as 1-[(j-k)/(max-min)]^p, where max and min are the maximum and minimum category values and p is the numeric value specified in weight=.

To provide more control on the distribution of weights across the category pairs, the weights can be computed based on quantiles of an exponential distribution. By specifying the parameter of the exponential distribution in wtparm= (weight= must also be specified), the values from the above weight formula are used as probabilities for computing quantiles from an exponential distribution with the specified parameter. These quantiles become the weights. Increasing values of the parameter, usually in the range 0.1 to 2, tends to reduce the weights particularly toward the off-diagonal corners of the weight matrix.

When Gwet statistics are requested and the percategory option is specified to assess agreement on each category, the response is dichotomized such that all categories other than the one in question are combined. As a result, ordinality no longer applies and weights are not used. The percategory option is not available with GLMM-based statistics.

Kendall's W

Kendall's coefficient of concordance, W, is a nonparametric statistic related to Friedman's test and tests agreement of the raters' rankings of the items. It is available when the response is ordinal and numerically coded. SAS/STAT^® is required. Since the macro ranks the raters' values, the response can be continuous. Mid-ranks are used when ties occur. All raters must rate all items exactly once. A large-sample test is provided by default. The jackknife-based estimate of the standard deviation of W (Grothe and Schmid, 2011), and confidence limits using this estimate, are provided with the wcl option. Grothe and Schmid (2011) discuss coverage probabilities of this interval based on their simulation results. They indicate that this estimate improves with increasing number of items and is useful when the number of items exceeds 10 or 20. Since the response may be continuous and not have a fixed set of discrete levels, agreement for individual categories is not possible.

BY group processing

While the MAGREE macro does not directly support BY group processing, this capability can be provided by the RunBY macro which can run the MAGREE macro repeatedly for each of the BY groups in your data. See the RunBY macro documentation (SAS Note 66249) for details about its use. Also see the example titled "BY group processing" in the Results tab above.

Output data sets

Values of all statistics, their standard errors, confidence intervals, and p-values are available in data sets after execution of the macro. The kappa estimates and related statistics are in the data set _KAPPAS. Kendall's coefficient and related statistic are in the data set _CONCORD. Gwet's agreement coefficient and related statistics are saved in data set _GAC. The GLMM-based statistics are saved in data set _GLMM. If the ratereffects option is specified, the results are in data set _GLMMVI. If the itemeffects option is specified, the results are in data set _GLMMUI.

LIMITATIONS:

There must be more than one rater and more than one item to compute any of the statistics. Also, no statistics are computed if any rater rates any item more than once. The set of values in the raters= variable must be the same for all items, even if different raters are used for different items (which is only valid for the kappa and Gwet statistics). See MISSING VALUES below. Kappa, Gwet, and GLMM-based statistics require that each item have zero or one rating from each rater. Items may have unequal numbers of ratings, but in that case only confidence intervals, not tests, are available for kappa. However, in the case of unequal numbers of ratings per item, both confidence intervals and tests are available for Gwet's statistics. Kendall's statistic requires that the response variable be numeric and ordinal and that all items have exactly one rating from all raters.

MISSING VALUES:

Observations with missing values in the response=, items=, or raters= variable are removed before the analysis. Error messages are issued and no statistics are computed if any rater rates an item more than once. If some raters do not rate some items, the Kendall statistic and tests for kappa statistics are not computed. However, the GLMM-based statistics and confidence intervals for Fleiss' kappa are provided in this case. While Gwet's statistic is provided even if some raters do not rate some items, at least three raters are required to estimate the fixed-items and unconditional standard errors.

SEE ALSO:

The FREQ procedure in Base SAS software can compute simple and weighted kappa statistics for comparing two raters. Beginning in SAS 9.4 TS1M4, it can also provide Gwet's AC₁ statistic for the two rater case. Use the AGREE or AGREE(AC1) option in the TABLES statement.

REFERENCES:

Fleiss J.L., Levin B., Paik M.C. (2003), Statistical Methods for Rates and Proportions, 3rd ed., Hoboken, NJ: John Wiley & Sons, Inc.

Fleiss, J.L. (1981), Statistical Methods for Rates and Proportions, 2nd ed., New York: John Wiley & Sons, Inc.

Fleiss, J.L., Nee, J.C.M, and Landis, J.R. (1979), "Large Sample Variance of Kappa in the Case of Different Sets of Raters," Psychological Bulletin, 86(5), 974-977.

Grothe O. and Schmid F. (2011), "Kendall's W Reconsidered," Communications in Statistics – Simulation and Computation, 40: 285-305.

Gwet, K.L. (2008), "Computing inter-rater reliability and its variance in the presence of high agreement," British Journal of Mathematical and Statistical Psychology, 61, 29-48.

Kendall, M.G. (1955), Rank Correlation Methods, Second Edition, London: Charles Griffin & Co. Ltd.

Landis, J.R. and Koch G.G. (1977), "The measurement of observer agreement for categorical data," Biometrics, 33, 159-174.

Nelson, K.P. and Edwards D. (2015), "Measures of agreement between many raters for ordinal classifications," Stat Med., 34(23), 3116-3132.

Nelson, K.P. and Edwards D. (2018), "A measure of association for ordered categorical data in population-based studies." Stat Methods Med. Res., 27(3), 812-831.

Vanbelle, S. (2018), "Asymptotic variability of (multilevel) multirater kappa coefficients," Statistical Methods in Medical Research.

These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.

Example 1: Estimating and testing rater agreement

This example is from Fleiss et. al. (2003, Table 18.8). Ten subjects (S) are each rated into one of three categories (Y) by each of five raters (R). The following DATA step creates a data set with one observation per rating. The response, Y, is numeric allowing it to be treated as nominal or ordinal so that all statistics – Fleiss' kappa, Gwet's agreement coefficient, the GLMM-based statistics, and Kendall's coefficient of concordance – can be computed.

      title "Analysis of data from Fleiss (2003)";
      data ratings;
        do s=1 to 10;
        do r=1 to 5;
          input y @@;
          output;
        end; end;
        datalines;
      1 2 2 2 2 
      1 1 3 3 3
      3 3 3 3 3
      1 1 1 1 3
      1 1 1 3 3
      1 2 2 2 2
      1 1 1 1 1
      2 2 2 2 3
      1 3 3 3 3
      1 1 1 3 3
      ;

The following displays the data for the first two subjects showing the form of the data needed by the MAGREE macro with one observation per rating.

      proc print data=ratings;
        where s in (1,2);
        id s r;
        title2 'Required form of input data set';
        run;

Analysis of data from Fleiss (2003)

Required form of input data set

s	r	y
1	1	1
1	2	2
1	3	2
1	4	2
1	5	2
2	1	1
2	2	1
2	3	3
2	4	3
2	5	3

Nominal response

The following statements use the %inc statement to define the MAGREE macro in your SAS session. The macro needs to be defined only once per SAS session. The macro is then called to compute the agreement statistics. stat=nominal is specified to compute all statistics appropriate for a nominal response which includes the kappa statistic and Gwet's AC₁. The counts option is included to display the table showing, for each subject, the number of ratings in each response category.

      %inc "<location of your file containing the MAGREE macro>";
      %magree(data=ratings,
              items=s, raters=r, response=y,
              stat=nominal, options=counts)

The "Ratings Summary" table shows the number of subjects that have specific numbers of ratings. This table details any departure from balance. Balance occurs when all subjects have equal numbers of ratings. These data are balanced with all ten subjects rated once by all raters so that each subject has five ratings. The "Rating counts per item" table reproduces the table shown in Fleiss et. al. (2003).

Analysis of data from Fleiss (2003)

The MAGREE macro

Ratings Summary

Ratings Count	Items
5	10

Rating counts per item

The FREQ Procedure

Total

Following these data summary tables, the table of Fleiss' kappa statistics appears. The overall kappa estimate, 0.418, is significantly different from zero (p<0.0001). A 95% large-sample confidence interval for the overall kappa is (0.214, 0.621).

Analysis of data from Fleiss (2003)

The MAGREE macro

Kappa statistics for nominal response

Y	Kappa	Standard Error\|H0	Z	Prob>\|Z\|	Standard Error	Lower Confidence Limit	Upper Confidence Limit
Overall	0.41789	0.071653	5.83220	<.0001	0.10383	0.21439	0.62139

The table of Gwet agreement statistics is given next. Since the response is considered nominal, no weight matrix was specified. As a result, the AC₁ statistic is produced. The AC₁ estimate is 0.436. If the raters are considered fixed, so that inference is limited to the observed set of raters but subjects are considered randomly sampled from an infinite population, then AC₁ is significantly different from zero (p<0.0001) with a 95% confidence interval of (0.230, 0.642). Alternatively, if the raters are considered randomly sampled from an infinite population and subjects are considered fixed, then AC₁ is significantly different from zero (p=0.0280) with confidence interval (0.047, 0.825). If both raters and subjects are samples from infinite populations, then AC₁ is marginally significant (p=0.0522) with confidence interval is (-0.004, 0.876). Use the results appropriate for your study design.

Analysis of data from Fleiss (2003)

The MAGREE macro

Gwet's Agreement Coefficient

Fixed	AC1	Standard Error	Z Value	Pr > \|Z\|	Lower Confidence Limit	Upper Confidence Limit
Raters	0.43587	0.10511	4.14687	<.0001	0.22986	0.64187
Items	0.43587	0.19836	2.19732	0.0280	0.04708	0.82465
	0.43587	0.22449	1.94159	0.0522	-0.00412	0.87586

Ordinal response

If the response is considered ordinal then Gwet's AC₂, the GLMM-based statistics κ_m and κ_ma, and Kendall's coefficient of concordance, W, can be used which take the ordering into account when assessing the agreement among the raters. AC₂ and W also use the spacing of the categories. Fleiss' kappa and/or Gwet's AC₁ statistic could also be used, but they do not take the ordinal nature of the response into account, effectively treating them as nominal.

In the following macro calls, stat=ordinal is specified to compute all statistics appropriate for an ordinal response. The nosummary option is specified and the counts option is removed to avoid repeating those tables shown above. Kendall's W and κ_m for ordinal data are provided when weight= is not specified since they do not make use of weights. Also, the wcl option is included to provide a jackknife-based estimate of the standard deviation of Kendall's W and a 95% confidence interval. Specify alpha= if a different confidence level is desired.

      %magree(data=ratings,
              items=s, raters=r, response=y,
              stat=ordinal, options=nosummary wcl)

Assuming an ordinal response, the GLMM-based κ_m estimate of exact agreement is 0.214 with a 95% confidence interval of (0.063, 0.365). The estimate of Kendall's W is 0.491 with 95% confidence interval (0.191, 0.790). Gwet's AC₂ cannot be computed since the required weight matrix is not specified. That is addressed next.

The MAGREE macro

GLMM-based ordinal measure Kappa_m

Kappa_m	Standard Error	Lower Confidence Limit	Upper Confidence Limit	Rho	Rho Standard Error
0.21408	0.077221	0.062731	0.36543	0.45979	0.13653

Kendall's Coefficient of Concordance for ordinal response

Coefficient of Concordance	F	Num DF	Den DF	Prob>F	Standard Error	Lower Confidence Limit	Upper Confidence Limit
0.49058	3.852	8.6	34.4	0.0021	0.15299	0.19073	0.79044

To compute Gwet's AC₂ and κ_ma, a weight matrix is specified to allow for partial agreement to contribute. A linear weight matrix is specified with weight=linear which prevents Kendall's W and κ_m from being computed.

      %magree(data=ratings,
              items=s, raters=r, response=y,
              stat=ordinal, weight=linear, options=nosummary)

A table of Gwet's AC₂ statistics now appears following display of the linear weight matrix that was requested. The values in this matrix indicate the amount of partial agreement that is considered to exist for each possible disagreement in rating. The 0.5 value indicates partial agreement for a difference of one rating level. However, there is no partial agreement for a difference of two levels.

Using these weights, the AC₂ estimate is 0.298. As for the AC₁ results above, a test and confidence interval is provided depending on whether raters, subjects, or neither are considered fixed. The table showing the GLMM-based κ_ma is displayed next. Its estimated value is 0.304 with confidence interval (0.112, 0.496). The value of ρ is a moderate 0.460 suggesting the neither source of variability (raters or items) is dominant.

The MAGREE macro

Weight matrix for Gwet's AC₂ and/or GLMM Kappa_ma

w1	w2	w3
1.0	0.5	0.0
0.5	1.0	0.5
0.0	0.5	1.0

Gwet's Agreement Coefficient

Fixed	AC2	Standard Error	Z Value	Pr > \|Z\|	Lower Confidence Limit	Upper Confidence Limit
Raters	0.29825	0.15287	1.95096	0.0511	-0.00138	0.59787
Items	0.29825	0.21150	1.41013	0.1585	-0.11629	0.71278
	0.29825	0.26096	1.14286	0.2531	-0.21324	0.80973

Weighted GLMM-based ordinal measure Kappa_ma

Kappa_ma	Standard Error	Lower Confidence Limit	Upper Confidence Limit	Rho	Rho Standard Error
0.30415	0.097874	0.11232	0.49598	0.45979	0.13653

Example 2: Assessing individual rater and item effects

The κ_m and κ_ma statistics are based on a mixed model which includes random effects for the raters and for the items. Estimates of these random effects can be used to further investigate the effects of individual raters and/or items on departures from agreement.

Assuming that the response in the above data is ordinal, the following macro call requests the κ_m statistic and both the ratereffects and the itemeffects options to display tables of the individual effects. Note that these effects are the same whether they are then used to compute κ_m for exact agreement or κ_ma for partial agreement (association).

      %magree(data=ratings, items=s, raters=r, response=y,
              stat=glmm, options=ratereffects itemeffects)

The GLMM-based measure of exact agreement is κ_m=0.214 with confidence interval (0.063, 0.365) suggesting a small amount of exact agreement. Contrast this with the slightly stronger measure of partial agreement (κ_ma=0.304) using linear weights shown above. The value of rho, 0.46, indicates that the variability among items is about the same as the variability among raters.

Starting with the final table of rater effects, note that raters 1 and 5 are the most divergent from the others though neither differs significantly from zero. The negative value for rater 1 suggests that rater tends to rate items more toward the low end of the scale, while the positive value for rater 5 suggests that rater tends more toward the higher end. Similarly in the item effects table, the effect of subject 3 is significantly positive (p=0.047) suggesting that subject tends toward the high end of the scale. Subject 7 however has a marginally significant negative effect (p=0.057) indicating a lower position on the scale.

The MAGREE macro

GLMM-based ordinal measure Kappa_m

Kappa_m	Standard Error	Lower Confidence Limit	Upper Confidence Limit	Rho	Rho Standard Error
0.21408	0.077221	0.062731	0.36543	0.45979	0.13653

Estimated Item (S) random effects

Item	Estimate	Standard Error	DF	t Value	Pr > \|t\|	Alpha	Lower Confidence Limit	Upper Confidence Limit
s 1	-0.2599	0.6218	35	-0.42	0.6785	0.05	-1.5222	1.0024
s 2	0.4084	0.6638	35	0.62	0.5424	0.05	-0.9391	1.7559
s 3	2.0934	1.0165	35	2.06	0.0470	0.05	0.02982	4.1570
s 4	-0.9941	0.7248	35	-1.37	0.1789	0.05	-2.4656	0.4773
s 5	-0.2664	0.6619	35	-0.40	0.6898	0.05	-1.6101	1.0773
s 6	-0.2599	0.6218	35	-0.42	0.6785	0.05	-1.5222	1.0024
s 7	-1.9624	0.9962	35	-1.97	0.0568	0.05	-3.9849	0.06007
s 8	0.4058	0.6241	35	0.65	0.5198	0.05	-0.8612	1.6728
s 9	1.1363	0.7377	35	1.54	0.1325	0.05	-0.3612	2.6338
s 10	-0.2664	0.6619	35	-0.40	0.6898	0.05	-1.6101	1.0773

Estimated Rater (R) random effects

Rater	Estimate	Standard Error	DF	t Value	Pr > \|t\|	Alpha	Lower Confidence Limit	Upper Confidence Limit
r 1	-1.2621	0.6707	35	-1.88	0.0682	0.05	-2.6238	0.09953
r 2	-0.4039	0.5730	35	-0.70	0.4855	0.05	-1.5671	0.7592
r 3	-0.07079	0.5651	35	-0.13	0.9010	0.05	-1.2180	1.0765
r 4	0.5890	0.5826	35	1.01	0.3189	0.05	-0.5937	1.7717
r 5	1.1684	0.6423	35	1.82	0.0775	0.05	-0.1356	2.4724

Example 3: Per category agreement

Agreement of the raters on each response category can be assessed by specifying the percategory option. The following macro call provides kappa and AC₁ estimates for each category as well as overall. The Kendall and GLMM-based results (if requested specifically or with stat=all) are not affected by this option. The estimates for a category can be obtained by dichotomizing the response such that all categories other than the one in question are combined.

      %magree(data=ratings,
              items=s, raters=r, response=y,
              stat=nominal, options=percategory)

The statistics for the individual response categories indicate the degree of agreement on each category. In this example, agreement is strongest on category 2.

The MAGREE macro

Kappa statistics for nominal response

y	Kappa	Standard Error\|H0	Z	Prob>\|Z\|	Standard Error	Lower Confidence Limit	Upper Confidence Limit
1	0.29167	0.10000	2.91667	0.0018	0.15546	-0.01303	0.59636
2	0.67105	0.10000	6.71053	<.0001	0.05018	0.57271	0.76940
3	0.34896	0.10000	3.48958	0.0002	0.17249	0.01089	0.68703
Overall	0.41789	0.07165	5.83220	<.0001	0.10383	0.21439	0.62139

Gwet's Agreement Coefficient

y	Fixed	AC1	Standard Error	Z Value	Pr > \|Z\|	Lower Confidence Limit	Upper Confidence Limit
1	Raters	0.55263	0.14650	3.77224	0.0002	0.26550	0.83977
1	Items	0.55263	0.16593	3.33041	0.0009	0.22741	0.87786
1		0.55263	0.22135	2.49663	0.0125	0.11879	0.98647
2	Raters	0.85323	0.09996	8.53577	<.0001	0.65731	1.00000
2	Items	0.85323	0.09518	8.96395	<.0001	0.66667	1.00000
2		0.85323	0.13803	6.18153	<.0001	0.58270	1.00000
3	Raters	0.61019	0.14624	4.17242	<.0001	0.32356	0.89682
3	Items	0.61019	0.13142	4.64289	<.0001	0.35260	0.86777
3		0.61019	0.19662	3.10339	0.0019	0.22482	0.99555
Overall	Raters	0.43587	0.10511	4.14687	<.0001	0.22986	0.64187
Overall	Items	0.43587	0.19836	2.19732	0.0280	0.04708	0.82465
Overall		0.43587	0.22449	1.94159	0.0522	-0.00412	0.87586

Example 4: Unequal numbers of ratings per item

When some raters do not rate some items, not all statistics can be computed. If any item has only a single rating then the kappa statistics cannot be produced. However, as long as every item is rated by two or more raters, then kappa estimates and confidence intervals are provided, but tests are only available if all items are rated by all raters. Kappa statistics are also not available if some rater rates some item more than once.

Gwet statistics also are not available if some rater rates some item more than once. But estimates, tests, and confidence intervals are available even if some raters don't rate some items. Note that tests and confidence intervals for the fixed-items and unconditional study designs require that there be at least three raters in the study data.

GLMM-based statistics are available if some raters do not rate some items, but not if any rater rates an item more than once.

Kendall's statistic cannot be provided if some raters do not rate some items, or if any rater rates an item more than once.

The following modification of Example 1 illustrates the results when some raters do not rate some items.

       title "Analysis of Fleiss (2003) data with missings";
       data missings;
          do S=1 to 10;
          do R=1 to 5;
            input Y @@;
            output;
          end; end;
          datalines;
        1 2 2 2 . 
        1 1 3 3 3
        . 3 3 3 3
        1 1 1 1 3
        1 1 1 3 3
        1 2 . . .
        1 1 1 1 1
        2 2 . . 3
        1 3 3 3 3
        1 1 1 3 3
        ;
        %magree(data=missings, items=s, raters=r, response=y, stat=all)

These messages appear in the log indicating that the missing ratings prevent the Kendall statistics and tests for the kappa statistics from being computed.

NOTE: Tests for kappa are only available when all items have an equal
      number of ratings.
NOTE: To compute the Kendall statistic, each rater must rate each
      item exactly once.

The multiple rows in the "Ratings Summary" table immediately indicates that some raters did not rate some items. In particular, only six items were rated by all five raters. One item had only one rating and another item had only three. Two items had four ratings.

Analysis of Fleiss (2003) data with missings

The MAGREE macro

Ratings Summary

Ratings Count	Items
2	1
3	1
4	2
5	6

The results show the kappa statistics with confidence intervals followed by the Gwet statistics and the GLMM-based statistics. Note that tests for the kappa statistics are not provided. Also as noted above, the Kendall statistics could not be computed due to the missing ratings.

Analysis of Fleiss (2003) data with missings

The MAGREE macro

Kappa statistics for nominal response

Y	Kappa	Standard Error	Lower Confidence Limit	Upper Confidence Limit
Overall	0.24894	0.12985	-.005554655	0.50344

Gwet's Agreement Coefficient

Fixed	AC1	Standard Error	Z Value	Pr > \|Z\|	Lower Confidence Limit	Upper Confidence Limit
Raters	0.30176	0.15076	2.00154	0.0453	0.00627	0.59725
Items	0.30176	0.20061	1.50424	0.1325	-0.09142	0.69494
	0.30176	0.25094	1.20250	0.2292	-0.19008	0.79360

GLMM-based ordinal measure Kappa_m

Kappa_m	Standard Error	Lower Confidence Limit	Upper Confidence Limit	Rho	Rho Standard Error
0.18171	0.079854	0.025201	0.33822	0.40067	0.15051

Example 5: Ordinal data – Weight matrices for Gwet's AC₂ and κ_ma

When the response has ordered categories, any disagreement in the ratings from two raters can be small or large as measured by the difference in the category values. You might consider small differences to represent partial agreement and increasingly larger differences to represent diminishing amounts of agreement with very large differences representing no agreement. This can be done by defining a weight matrix which is a symmetric, k×k matrix of values ranging from 1 to 0, where k is the number of response categories. Each element in the matrix essentially quantifies the amount of partial agreement that a pair of ratings has. Consequently, the diagonal elements where both ratings are the same always equal 1. Elements off of but near the diagonal typically are assigned larger values since these elements represent smaller differences and more agreement in ratings than elements far from the diagonal. Often, the extreme corners of the matrix are assigned zero weights.

You can specify weight= to define a weight matrix for use in computing the AC₂ and κ_ma statistics for ordinal responses. Optionally, wtparm= can be specified in addition to weight=. The Usage and Details sections of this documentation describe how the weights are computed based on the specified values. Alternatively, you can specify your own weight matrix by saving it in a data set and specifying the name of that data set in weight=.

The following illustrates some of the weight matrices that can be generated and the effect on the AC₂ and κ_ma statistics. The following macro call computes the unweighted AC₁ and κ_m statistics for a data set of 15 items and 10 raters using a 5-point ordinal response with categories valued 1, 2, 3, 4, or 5.

        %magree(data=a,
                items=s, raters=r, response=y, 
                stat=gwet glmm, options=counts nosummary)

There is noticeable agreement shown in the rating counts table. For example, nine of the ten raters agreed that item one belongs in category 2. For item 2, eight of ten agree on category 4, while two others choose categories 2 and 3. The resulting unweighted AC₁ agreement coefficient estimate is 0.536 indicating a moderate level of exact agreement and is significantly different from zero (p<0.0001). The GLMM-based κ_m finds a weaker degree of exact agreement, estimated at 0.164 with confidence interval (0.085, 0.242).

The MAGREE macro

Rating counts per item

The FREQ Procedure

Total

150

Gwet's Agreement Coefficient

Fixed	AC1	Standard Error	Z Value	Pr > \|Z\|	Lower Confidence Limit	Upper Confidence Limit
Raters	0.53638	0.056783	9.44606	<.0001	0.42509	0.64767
Items	0.53638	0.071871	7.46309	<.0001	0.39552	0.67725
	0.53638	0.091596	5.85594	<.0001	0.35686	0.71591

GLMM-based ordinal measure Kappa_m

Kappa_m	Standard Error	Lower Confidence Limit	Upper Confidence Limit	Rho	Rho Standard Error
0.16362	0.040056	0.085110	0.24213	0.52942	0.091248

Suppose it is felt that agreement should be estimated allowing for partial agreement depending on the size of the difference in ratings. One way to do that is with a linear weight matrix which can be requested by weight=linear, or equivalently, weight=1.

        %magree(data=a,
                items=s, raters=r, response=y, 
                stat=gwet glmm, weight=linear, options=nosummary)

Notice that the weight values decrease progressively to zero with increasing distance from the diagonal. The even steps in value in bands parallel to the diagonal occur because of the evenly-spaced nature of the response categories in this example. Notice also that allowing for partial agreement has increased the agreement coefficient, AC₂, to 0.637, and the GLMM-based κ_ma to 0.355 because of the weighting.

The MAGREE macro

Weight matrix for Gwet's AC₂ and/or GLMM Kappa_ma

w1	w2	w3	w4	w5
1.00	0.75	0.50	0.25	0.00
0.75	1.00	0.75	0.50	0.25
0.50	0.75	1.00	0.75	0.50
0.25	0.50	0.75	1.00	0.75
0.00	0.25	0.50	0.75	1.00

Gwet's Agreement Coefficient

Fixed	AC2	Standard Error	Z Value	Pr > \|Z\|	Lower Confidence Limit	Upper Confidence Limit
Raters	0.63674	0.051262	12.4213	<.0001	0.53627	0.73721
Items	0.63674	0.084428	7.5418	<.0001	0.47126	0.80221
	0.63674	0.098772	6.4466	<.0001	0.44315	0.83033

Weighted GLMM-based ordinal measure Kappa_ma

Kappa_ma	Standard Error	Lower Confidence Limit	Upper Confidence Limit	Rho	Rho Standard Error
0.35518	0.068474	0.22097	0.48939	0.52942	0.091248

Another possible weight matrix is the quadratic matrix which could also be requested by weight=quadratic, or equivalently, weight=2.

        %magree(data=a,
                items=s, raters=r, response=y, 
                stat=gwet glmm, weight=quadratic, options=nosummary)

Notice that the larger power (2 vs 1) for this matrix increases the weight values meaning that more agreement is contributed by those differences. As a result, AC₂ further increases to 0.727. κ_ma is essentially unchanged from the linear weight results.

The MAGREE macro

Weight matrix for Gwet's AC₂ and/or GLMM Kappa_ma

w1	w2	w3	w4	w5
1.0000	0.9375	0.7500	0.4375	0.0000
0.9375	1.0000	0.9375	0.7500	0.4375
0.7500	0.9375	1.0000	0.9375	0.7500
0.4375	0.7500	0.9375	1.0000	0.9375
0.0000	0.4375	0.7500	0.9375	1.0000

Gwet's Agreement Coefficient

Fixed	AC2	Standard Error	Z Value	Pr > \|Z\|	Lower Confidence Limit	Upper Confidence Limit
Raters	0.72677	0.06389	11.3756	<.0001	0.60155	0.85198
Items	0.72677	0.09142	7.9500	<.0001	0.54759	0.90594
	0.72677	0.11153	6.5164	<.0001	0.50817	0.94536

The MAGREE macro

Weighted GLMM-based ordinal measure Kappa_ma

Kappa_ma	Standard Error	Lower Confidence Limit	Upper Confidence Limit	Rho	Rho Standard Error
0.35519	0.068474	0.22098	0.48940	0.52942	0.091248

The square root matrix, weight=sqrt or weight=0.5, produces smaller weights than the linear and quadratic matrices.

        %magree(data=a,
                items=s, raters=r, response=y, 
                stat=gwet glmm, weight=sqrt, options=nosummary)

The smaller contributions to agreement reduce AC₂ to 0.586, but κ_ma remains unchanged at 0.355.

The MAGREE macro

Weight matrix for Gwet's AC₂ and/or GLMM Kappa_ma

w1	w2	w3	w4	w5
1.00000	0.50000	0.29289	0.13397	0.00000
0.50000	1.00000	0.50000	0.29289	0.13397
0.29289	0.50000	1.00000	0.50000	0.29289
0.13397	0.29289	0.50000	1.00000	0.50000
0.00000	0.13397	0.29289	0.50000	1.00000

Gwet's Agreement Coefficient

Fixed	AC2	Standard Error	Z Value	Pr > \|Z\|	Lower Confidence Limit	Upper Confidence Limit
Raters	0.58567	0.048946	11.9657	<.0001	0.48974	0.68160
Items	0.58567	0.078569	7.4542	<.0001	0.43168	0.73966
	0.58567	0.092568	6.3269	<.0001	0.40424	0.76710

Weighted GLMM-based ordinal measure Kappa_ma

Kappa_ma	Standard Error	Lower Confidence Limit	Upper Confidence Limit	Rho	Rho Standard Error
0.35517	0.068474	0.22097	0.48938	0.52942	0.091248

More control, in the sense of more rapidly dropping weights at larger differences, can be done by also specifying wtparm=. This can be used along with any of the weight= values (except when weight=data-set-name). This involves quantiles from an exponential distribution with parameter equal to the wtparm= value. Typically, suitable results can be obtained with values between 0.01 and 2. A parameter of 1 is used along with the linear matrix in this next call to the macro.

        %magree(data=a,
                items=s, raters=r, response=y, 
                stat=gwet glmm, weight=linear, wtparm=1, options=nosummary)

Compared to the linear matrix alone, the weight diminishes quickly, becoming zero for the second largest difference in addition to the largest. The AC₂ estimate is 0.614 which is slightly less than with the linear matrix alone. κ_ma remains unchanged at 0.355.

The MAGREE macro

Weight matrix for Gwet's AC₂ and/or GLMM Kappa_ma

w1	w2	w3	w4	w5
1.00000	0.71232	0.30685	0.00000	0.00000
0.71232	1.00000	0.71232	0.30685	0.00000
0.30685	0.71232	1.00000	0.71232	0.30685
0.00000	0.30685	0.71232	1.00000	0.71232
0.00000	0.00000	0.30685	0.71232	1.00000

Gwet's Agreement Coefficient

Fixed	AC2	Standard Error	Z Value	Pr > \|Z\|	Lower Confidence Limit	Upper Confidence Limit
Raters	0.61407	0.049157	12.4920	<.0001	0.51772	0.71041
Items	0.61407	0.086580	7.0924	<.0001	0.44437	0.78376
	0.61407	0.099562	6.1677	<.0001	0.41893	0.80920

Weighted GLMM-based ordinal measure Kappa_ma

Kappa_ma	Standard Error	Lower Confidence Limit	Upper Confidence Limit	Rho	Rho Standard Error
0.35517	0.068474	0.22097	0.48938	0.52942	0.091248

Note that if the percategory option were used in the above, the category-level Gwet AC₂ results would not change with different weights since those results are based on a dichotomized response. With a two-level response, the ordinality is lost and so weights are not used.

EXAMPLE 6: BY group processing

While the MAGREE macro does not support BY processing directly, the RunBY macro can be used to run the macro on BY groups in the data. In the statements below, the data set in Example 1 above is expanded to include data from subjects and raters on two different days. The RunBY macro is then used to run the MAGREE macro on each day in turn. This is done with a DATA step that includes a subsetting WHERE statement that specifies the special macro variables, _BYx and _LVLx, which are used by the RunBY macro to process each BY group. The BYlabel macro variable is also used to label the displayed results with the BY group definition. Since the MAGREE macro writes its own titles, a FOOTNOTE statement is used instead of a TITLE statement to provide the label.

      data ratings;
         do day=1,2;
         do s=1 to 10;
         do r=1 to 5;
           input y @@;
           output;
         end; end; end;
         datalines;
       1 2 2 2 2 
       1 1 3 3 3
       3 3 3 3 3
       1 1 1 1 3
       1 1 1 3 3
       1 2 2 2 2
       1 1 1 1 1
       2 2 2 2 3
       1 3 3 3 3
       1 1 1 3 3
       1 2 2 2 2 
       1 1 3 3 3
       3 3 2 3 3
       1 1 1 1 3
       1 1 2 3 3
       1 2 2 2 2
       1 1 2 3 1
       2 1 2 2 3
       1 3 3 3 3
       1 1 1 3 3
       ;
      %macro code();
        data subset; set ratings; where &_BY1=&_LVL1; run;
        footnote "Above for &BYlabel";
        %magree(data=subset, items=s, raters=r, response=y, stat=nominal)
        footnote;
      %mend;
      %RunBY(data=ratings, by=day)

Type:	Sample
Topic:	SAS Reference ==> Procedures ==> FREQ Analytics ==> Nonparametric Analysis SAS Reference ==> Macro Analytics ==> Descriptive Statistics Analytics ==> Categorical Data Analysis

Date Modified:	2025-09-26 15:55:23
Date Created:	2005-01-13 15:03:24

Product Family	Product	Host	SAS Release
			Starting	Ending
SAS System	SAS/STAT	All	n/a	n/a

Support

Sample 25006: Compute estimates and tests of agreement among multiple raters

Compute estimates and tests of agreement among multiple raters

BY group processing

Operating System and Release Information