Contents: |
Purpose / History / Requirements / Usage / Details / Limitations / Missing Values / See Also / References |

**NOTE: For assessing agreement between two raters, use the AGREE option in PROC FREQ. **

*PURPOSE:*- Compute estimates and tests of agreement among multiple raters when responses (ratings) are on a nominal or ordinal scale. For a nominal response, kappa and Gwet's AC
_{1}agreement coefficient are available. For an ordinal response, Gwet's weighted agreement coefficient (AC_{2}) and statistics based on a cumulative probit model with random effects for raters and items are available. If the response is numerically-coded (and possibly continuous), Kendall's coefficient of concordance is also available. *HISTORY:*- The version of the MAGREE macro that you are using is displayed in the log when you specify
**version**(or any string) as the first argument. For example:%magree(version, <

*macro options*>)The MAGREE macro always attempts to check for a later version of itself. If it is unable to do this (such as if there is no active Internet connection available), the macro will issue the following message:

NOTE: Unable to check for newer version

The computations performed by the macro are not affected by the appearance of this message.

*Version**Update Notes*3.7 Added **nominal**and**ordinal**as valid values in**stat=**.3.6 Added GLMM-based κ _{m}and κ_{ma}(Nelson and Edwards, 2015, 2018). Changed**stat=**to accept a list. Added**itemeffect**and**ratereffect**options.3.3 Added **wcl**option.3.2 Added **percategory**option. Category kappas no longer displayed by default.3.1 Suppress display of fixed-items and unconditional rows in Gwet results with fewer than three raters. Added **(no)fixrateronly**option.3.0 Added Gwet's agreement coefficients. Added options **noprint**,**noweights**, and**counts**. Former**nobaltab**option changed to**nosummary**.2.1 Fixed minor problem when name in **items**=,**raters**=, or**response**= differs in case from name in data.2.0 Confidence intervals are provided for kappa (Vanbelle, 2018). Added **alpha=**option. Allow computation of kappa and confidence intervals (but not tests) with unequal number of ratings per item. Added optional table of rating counts across items.1.3 Fixes errors caused by using the name COUNT or PERCENT for the **items=**,**raters=**, or**response=**variable.1.2 Corrected cases of perfect agreement. Require more than one item and rater. Added check for new version, reset of _last_ data set and notes option at end. 1.1 Added error checking for parameters, data set, and variables. Added **version**option. *REQUIREMENTS:*- Only Base SAS
^{®}is required for kappa and Gwet's statistics. SAS/STAT^{®}is required for the Kendall and GLMM-based statistics. SAS/IML^{®}is required for GLMM-based statistics. *USAGE:*- Follow the instructions in the Downloads tab of this sample to save the MAGREE macro definition. Replace the text within quotation marks in the following statement with the location of the MAGREE macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the MAGREE macro and make it available for use:
%inc "<location of your file containing the MAGREE macro>";

Following this statement, you may call the MAGREE macro. See the Results tab for an example.

The following parameters are required:

**items=***variable*- Specifies the variable containing item (or subject) identifiers. This variable may be numeric or character. There must be more than one item.
**raters=***variable*- Specifies the variable containing rater identifiers. This variable may be numeric or character. The same rater values must be used for all items even if items are rated by different raters. Note that Kendall's statistic is only valid when all items are rated by the same raters. There must be more than one rater.
**response=***variable*- Specifies the variable containing the ratings. This variable may be numeric or character. If it is character, it is assumed to be nominal and only kappa and Gwet's statistics are computed. If it is numeric, any of the statistics can be computed, but note that the Kendall and GLMM-based statistics are only valid for ordinal responses. Any format assigned to the response variable is ignored.

The following parameters are optional:

**data=***data-set-name*- Identifies the input data set to be analyzed. If not specified, the last-created data set is used. It should contain one observation per rating – that is, each observation should contain one rating on one item by one rater with variables containing item and rater identifiers and a variable containing the rating (response). The same set of rater values must be used for each item, even if items are rated by different raters (which is only valid for kappa and Gwet statistics).
**stat=nominal | ordinal |***statistics*- Specifies the statistics to calculate. Specify
**nominal**to request all relevant statistics for a nominal scale response, or specify**ordinal**to request all relevant statistics for a ordinal scale response, or specify one or more particular statistics separated by spaces. Statistics that you can specify include the following. Specify**kappa**to compute Fleiss' kappa statistics for a nominal response. Specify**gwet**to compute Gwet's AC_{1}statistic for a nominal response, or Gwet's AC_{2}statistic for an ordinal response if**weight=**is also specified. Specify**kendall**to compute Kendall's coefficient of concordance for an ordinal or continuous response. Specify**glmm**to compute the GLMM-based κ_{m}and κ_{ma}statistics for an ordinal response. Note that the Kendall and GLMM-based statistics are not computed if the**response=**variable is a character variable. Use**all**to compute all statistics (if possible). By default,**stat=all**. **alpha=***value*- Specifies the alpha level for confidence intervals with confidence level equal to 1-
*value*.*Value*must be between 0 and 1. The default is**alpha=0.05**. **weight=linear | quadratic | sqrt | id |***value*|*data-set-name*- Specifies the structure of a symmetric weight matrix to be computed for use in Gwet's AC
_{2}statistic and the GLMM-based κ_{ma}statistic for ordinal responses. The Kendall and Kappa statistics are not computed when**weight=**is specified. See**Weights**below for details.*Value*, if specified, must be between 0.01 and 5. Specifying**weight=linear**is equivalent to**weight=1**as are**weight=2**and**weight=quadratic**and**weight=sqrt**and**weight=0.5**. When**weight=***data-set-name*, the specified data set must have only*k*observations and*k*numeric variables containing the weight matrix. Any names can be used for the variables in the data set.**weight=**is ignored unless**gwet**,**glmm**, or**all**is specified in**stat=**. If omitted, Gwet's AC_{1}is computed when the Gwet statistic is requested, and κ_{m}is computed when the GLMM-based statistics are requested. **wtparm=***value*- Specifies the parameter of the exponential distribution that can optionally be used in the computation of a symmetric weight matrix for Gwet's AC
_{2}statistic or the GLMM-based κ_{ma}statistic for ordinal responses. See**Weights**below for details. If specified,*value*must be greater than or equal to 0.01. Ignored unless**weight=**and either**gwet**,**glmm**, or**all**is specified in**stat=**. Also ignored if a data set name is specified in**weight=**. **options=***options*- One or more of the following options can be specified, separated by spaces.
- counts
- Displays a table showing the number of ratings in each category for each item.
- noprint
- Suppresses all displayed results. See
**Output data sets**below. - nosummary
- Suppresses display of the table showing the number of items that have specific numbers of ratings.
- noweights
- Suppresses display of the weight matrix, if specified.
- t
- Tests and confidence limits for Gwet's statistics and confidence limits for Kendall's W are computed using the
*t*distribution. By default, or with option**z**, large-sample values are computed using the standard normal distribution. - z
- Tests and confidence limits for Gwet's statistics and confidence limits for Kendall's W are computed using the standard normal distribution. This is the default.
- fixrateronly
- Suppresses computation of the fixed-items and unconditional variances and related Gwet statistics. This avoids the iterative jackknife method that becomes more time consuming as the number of items increases.
- percategory
- For Fleiss' kappa and Gwet's agreement coefficients, provides statistics for assessing agreement on each response category. Category-level agreement does not apply to the Kendall and GLMM-based statistics.
- wcl
- Provides a jackknife-based confidence interval for Kendall's coefficient of concordance, W.
- itemeffects
- When GLMM-based statistics are requested, displays estimated random effects for each item.
- ratereffects
- When GLMM-based statistics are requested, displays estimated random effects for each rater.

*DETAILS:*- The methodology implemented by the MAGREE macro is presented in Vanbelle (2018), Fleiss (1981, 2003), Gwet (2008 and agreestat.com), Fleiss et. al. (1979), Kendall (1955), and Nelson and Edwards (2015, 2018). See these references for details. For the kappa and Gwet statistics, the MAGREE macro assumes that the response rating scale is categorical and that each possible value of the scale is observed at least once. That is, if the raters rate the items on a scale with
*k*levels, but only*k-p*levels appear in the data, then these agreement statistics are appropriate for assessing rater agreement on a scale with only*k-p*levels.**Fleiss' kappa**Fleiss' kappa statistic is provided for the multirater case. This statistic is available for nominal responses (or ordinal responses to be treated as nominal) and may have numeric or character values. Raters may rate each item no more than once. If all items are rated by all raters exactly once, estimates of kappa (overall, and for each response category when the

**percategory**option is specified) are provided along with tests and confidence intervals. If some raters do not rate some items, kappa estimates and confidence intervals (Vanbelle, 2018) are still available but tests cannot be provided. As noted by Fleiss et. al. (1979) concerning the multiple-rater kappa statistic, the raters for a given item "are not necessarily the same as those rating another."Note that Fleiss' statistic does not reduce to Cohen's kappa statistic (as provided by the AGREE option in PROC FREQ) for the case of only two raters. As described by Gwet (2018), this is because the Fleiss statistic is really an extension of Scott's π statistic rather than of Cohen's kappa.

Landis and Koch (1977) attempted to qualify the degree of agreement that exists when kappa is found to be in various ranges. However, the practical meaning of any given value of kappa is likely to depend on the area of study. Note that the maximum value of kappa is one indicating complete agreement. Kappa can be negative when there is no agreement.

Kappa value Degree of agreement ≤ 0 Poor 0 - 0.2 Slight 0.2 - 0.4 Fair 0.4 - 0.6 Moderate 0.6 - 0.8 Substantial 0.8 - 1 Almost perfect Unlike the AGREE option in PROC FREQ for the two rater case, weighted kappa is not available in this macro for the multirater case.

**Gwet's AC**_{1}and AC_{2}Gwet's agreement coefficients, AC

_{1}or AC_{2}, are available for nominal (unordered) and ordinal responses. For a nominal response, AC_{1}can be used to estimate agreement among the raters. When the response is ordinal and a weight matrix is provided (see**Weights**below), interrater agreement is estimated by the AC_{2}statistic. The maximum value of these agreement coefficients is one indicating perfect agreement. The value can be negative when there is no agreement.Gwet developed three different standard error estimates and these are provided along with a test and confidence interval for each. The first is the standard error when raters are considered fixed and items are sampled. This is useful when you are interested in estimating or testing agreement among just the raters that were observed. The second is the standard error when items are considered fixed and raters are sampled. Interest is then in rater agreement involving only the observed items. Finally, the third standard error is fully unconditional and treats both raters and items as sampled. This allows agreement to be assessed over the entire populations of raters and items. By default, a large-sample test and confidence interval are provided. If the

**percategory**option is specified, these statistics are also provided for each response category.Unlike Fleiss' kappa, tests and confidence intervals based on all three standard errors are available whether all items are rated by all raters or not. Items can have unequal numbers of ratings. However, the fixed-items and unconditional standard errors require at least three raters.

When there are only two raters, the AC

_{1}statistic is the same as computed by the AGREE(AC1) option in PROC FREQ in SAS 9.4 TS1M4 or later. However, the standard error estimates differ because different estimation methods are used. Gwet (2008) provides the estimator for the fixed-rater variance. A jackknife approach is used to estimate the fixed-items variance. The unconditional variance is the sum of the two conditional variances.**GLMM-based κ**_{m}and κ_{ma}Nelson and Edwards' (2015, 2018) statistics based on a generalized linear mixed models (GLMM) are available for ordinal response data. SAS/STAT

^{®}and SAS/IML^{®}are required. Raters are not required to rate all items. They note that, unlike some other measures, their statistics are invariant to the prevalence of the condition measured by the ordinal response. The measures and inferences based on them apply to the rater and item populations from which they are randomly selected. Raters are assumed to independently rate each item. A cumulative probit GLMM with random effects for raters and items is fitted assuming independent and normally distributed random effects. The measures are defined using the rater and item variance estimates from the model. Exact agreement among the raters is measured by κ_{m}. Partial agreement or association, in which rater differences of varying size contribute partially to the measure as defined by a weight matrix (see**Weights**below), is assessed by the κ_{ma}statistic. Both statistics range from zero to one, with values near zero indicating low agreement and values near one indicating strong agreement (after correcting for chance).The macro provides estimates of both measures along with confidence limits. With this model-based approach, estimates of the random effects for raters and items (

**ratereffects**and**itemeffects**options) are available which you can use to investigate departures from agreement for individual raters or items. A rater with large positive random effect indicates that the rater tends to rate items higher than other raters. A large negative effect indicates a tendency to rate lower than others. Similarly for items, a large effect (positive or negative) suggests that the item is more clearly located at one end of the scale while an effect near zero indicates less clarity.Also displayed is an estimate of ρ (Rho) which essentially measures the amount of total variability in the ratings attributable to the items and ranges between 0 and 1. A value of ρ near one indicates that most of the variability is due to item variability. High item variability suggests that the items are relatively distinct and easier for raters to distinguish. Nelson and Edwards suggest that other agreement measures might be liberal (overly large) in this situation. A value of ρ near zero indicates most variability is among the raters rather than items. When the variability among the items is relatively small, this could suggest that items are not as distinct making it difficult for raters to distinguish them.

See the cited papers for details and the results of simulation studies exploring the performance of these statistics.

**Weights**You can specify

**weight=**to request use of a computed weight matrix or to specify your own matrix to be used in the computation of Gwet's AC_{2}and/or the GLMM-based κ_{ma}.The values in the weight matrix can be thought of as the amount of agreement for differences of various sizes among the ordered response categories. Typically, a difference of one category (such as a rating of 1 vs 2) is considered less serious (or more in agreement) than a difference of more than one category (such as 1 vs 5). Consequently, values close to the main diagonal (where 1 represents complete agreement) of the weight matrix typically are large and then diminish to zero at the far, off-diagonal corners. Note that the weight matrix should be symmetric, must be square with the number of rows and columns equal to the number of response categories, must have all diagonal elements equal to 1, and must contain no negative values. The unweighted AC

_{1}statistic, used when treating the response categories as nominal, can be thought of as considering all differences of any size as equally serious.Predefined weight matrices – linear, quadratic, square root, and identity – are available or you can provide your own. With the identity matrix, the AC

_{2}results are the same as the unweighted AC_{1}results. Alternatively, you can provide a power value for defining those matrices as well as matrices based on a wider range of powers. For example,**weight=1.5**defines a weight matrix between linear and quadratic. When a weight matrix is specified and the response is character valued, the character values in sorted order are assigned the equally-spaced numeric values 1, 2, 3, and so on. For κ_{ma}, values 1, 2, 3, ... are always used whether the response is character or numeric. When the response variable is numeric, the response variable values are used in the computation of the weights for the AC_{2}statistic. This allows the weights to reflect categories that are not equally spaced. To treat the categories as equally-spaced, assign values to the categories that are equally-spaced, or assign character values to the categories. For categories with values j and k, weights are computed as 1-[(j-k)/(max-min)]^{p}, where max and min are the maximum and minimum category values and p is the numeric value specified in**weight=**.To provide more control on the distribution of weights across the category pairs, the weights can be computed based on quantiles of an exponential distribution. By specifying the parameter of the exponential distribution in

**wtparm=**(**weight=**must also be specified), the values from the above weight formula are used as probabilities for computing quantiles from an exponential distribution with the specified parameter. These quantiles become the weights. Increasing values of the parameter, usually in the range 0.1 to 2, tends to reduce the weights particularly toward the off-diagonal corners of the weight matrix.When Gwet statistics are requested and the

**percategory**option is specified to assess agreement on each category, the response is dichotomized such that all categories other than the one in question are combined. As a result, ordinality no longer applies and weights are not used. The**percategory**option is not available with GLMM-based statistics.**Kendall's W**Kendall's coefficient of concordance, W, is a nonparametric statistic related to Friedman's test and tests agreement of the raters' rankings of the items. It is available when the response is ordinal and numerically coded. SAS/STAT

^{®}is required. Since the macro ranks the raters' values, the response can be continuous. Mid-ranks are used when ties occur. All raters must rate all items exactly once. A large-sample test is provided by default. The jackknife-based estimate of the standard deviation of W (Grothe and Schmid, 2011), and confidence limits using this estimate, are provided with the**wcl**option. Grothe and Schmid (2011) discuss coverage probabilities of this interval based on their simulation results. They indicate that this estimate improves with increasing number of items and is useful when the number of items exceeds 10 or 20. Since the response may be continuous and not have a fixed set of discrete levels, agreement for individual categories is not possible.**Output data sets**Values of all statistics, their standard errors, confidence intervals, and

*p*-values are available in data sets after execution of the macro. The kappa estimates and related statistics are in the data set _KAPPAS. Kendall's coefficient and related statistic are in the data set _CONCORD. Gwet's agreement coefficient and related statistics are saved in data set _GAC. The GLMM-based statistics are saved in data set _GLMM. If the**ratereffects**option is specified, the results are in data set _GLMMVI. If the**itemeffects**option is specified, the results are in data set _GLMMUI. *LIMITATIONS:*- There must be more than one rater and more than one item. The set of values in the
**raters=**variable must be the same for all items, even if different raters are used for different items (which is only valid for the kappa and Gwet statistics). See MISSING VALUES below. Kappa, Gwet, and GLMM-based statistics require that each item have zero or one rating from each rater. Items may have unequal numbers of ratings, but in that case only confidence intervals, not tests, are available for kappa. However, in the case of unequal numbers of ratings per item, both confidence intervals and tests are available for Gwet's statistics. Kendall's statistic requires that the response variable be numeric and ordinal and that all items have one rating from all raters. *MISSING VALUES:*- Observations with missing values in the
**response=**variable are removed before the analysis. Error messages are issued and no statistics are computed if any rater rates an item more than once. If some raters do not rate some items, the Kendall statistic and tests for kappa statistics are not computed. However, the GLMM-based statistics and confidence intervals for Fleiss' kappa are provided in this case. While Gwet's statistic is provided even if some raters do not rate some items, at least three raters are required to estimate the fixed-items and unconditional standard errors. *SEE ALSO:*- The FREQ procedure in Base SAS software can compute simple and weighted kappa statistics for comparing two raters. Beginning in SAS 9.4 TS1M4, it can also provide Gwet's AC
_{1}statistic for the two rater case. Use the AGREE or AGREE(AC1) option in the TABLES statement. *REFERENCES:*- Fleiss J.L., Levin B., Paik M.C. (2003),
*Statistical Methods for Rates and Proportions*, 3rd ed., Hoboken, NJ: John Wiley & Sons, Inc.Fleiss, J.L. (1981),

*Statistical Methods for Rates and Proportions*, 2nd ed., New York: John Wiley & Sons, Inc.Fleiss, J.L., Nee, J.C.M, and Landis, J.R. (1979), "Large Sample Variance of Kappa in the Case of Different Sets of Raters,"

*Psychological Bulletin*, 86(5), 974-977.Grothe O. and Schmid F. (2011), "Kendall's

*W*Reconsidered,"*Communications in Statistics – Simulation and Computation*, 40: 285-305.Gwet, K.L. (2008), "Computing inter-rater reliability and its variance in the presence of high agreement,"

*British Journal of Mathematical and Statistical Psychology*, 61, 29-48.Kendall, M.G. (1955),

*Rank Correlation Methods, Second Edition*, London: Charles Griffin & Co. Ltd.Landis, J.R. and Koch G.G. (1977), "The measurement of observer agreement for categorical data,"

*Biometrics*, 33, 159-174.Nelson, K.P. and Edwards D. (2015), "Measures of agreement between many raters for ordinal classifications,"

*Stat Med.*, 34(23), 3116-3132.Nelson, K.P. and Edwards D. (2018), "A measure of association for ordered categorical data in population-based studies."

*Stat Methods Med. Res.*, 27(3), 812-831.Vanbelle, S. (2018), "Asymptotic variability of (multilevel) multirater kappa coefficients,"

*Statistical Methods in Medical Research*.

These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.

These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.

**Example 1: Estimating and testing rater agreement**-
This example is from Fleiss et. al. (2003, Table 18.8). Ten subjects
(S) are each rated into one of three categories (Y) by each of five
raters (R). The following DATA step creates a data set with one observation per rating. The response, Y, is numeric allowing it to be treated as nominal or ordinal so that all statistics – Fleiss' kappa, Gwet's agreement coefficient, the GLMM-based statistics, and Kendall's coefficient of concordance – can be computed.
title "Analysis of data from Fleiss (2003)"; data ratings; do s=1 to 10; do r=1 to 5; input y @@; output; end; end; datalines; 1 2 2 2 2 1 1 3 3 3 3 3 3 3 3 1 1 1 1 3 1 1 1 3 3 1 2 2 2 2 1 1 1 1 1 2 2 2 2 3 1 3 3 3 3 1 1 1 3 3 ;

The following displays the data for the first two subjects showing the form of the data needed by the MAGREE macro with one observation per rating.

proc print data=ratings; where s in (1,2); id s r; title2 'Required form of input data set'; run;

Analysis of data from Fleiss (2003) Required form of input data set

s r y 1 1 1 1 2 2 1 3 2 1 4 2 1 5 2 2 1 1 2 2 1 2 3 3 2 4 3 2 5 3 **Nominal response**The following statements use the

**%inc**statement to define the MAGREE macro in your SAS session. The macro needs to be defined only once per SAS session. The macro is then called to compute the agreement statistics.**stat=nominal**is specified to compute all statistics appropriate for a nominal response which includes the kappa statistic and Gwet's AC_{1}. The**counts**option is included to display the table showing, for each subject, the number of ratings in each response category.%inc "<

*location of your file containing the MAGREE macro*>"; %magree(data=ratings, items=s, raters=r, response=y, stat=nominal, options=counts)The "Ratings Summary" table shows the number of subjects that have specific numbers of ratings. This table details any departure from balance. Balance occurs when all subjects have equal numbers of ratings. These data are balanced with all ten subjects rated once by all raters so that each subject has five ratings. The "Rating counts per item" table reproduces the table shown in Fleiss et. al. (2003).

Analysis of data from Fleiss (2003) The MAGREE macro Ratings Summary

Ratings

CountItems 5 10

Rating counts per item

The FREQ ProcedureFrequency Table of s by y s y 1 2 3 Total 1 1 4 0 5 2 2 0 3 5 3 0 0 5 5 4 4 0 1 5 5 3 0 2 5 6 1 4 0 5 7 5 0 0 5 8 0 4 1 5 9 1 0 4 5 10 3 0 2 5 Total 20 12 18 50 Following these data summary tables, the table of Fleiss' kappa statistics appears. The overall kappa estimate, 0.418, is significantly different from zero (

*p*<0.0001). A 95% large-sample confidence interval for the overall kappa is (0.214, 0.621).Analysis of data from Fleiss (2003) The MAGREE macro Kappa statistics for nominal response

Y Kappa Standard

Error|H0Z Prob>|Z| Standard

ErrorLower

Confidence

LimitUpper

Confidence

LimitOverall 0.41789 0.071653 5.83220 <.0001 0.10383 0.21439 0.62139 The table of Gwet agreement statistics is given next. Since the response is considered nominal, no weight matrix was specified. As a result, the AC

_{1}statistic is produced. The AC_{1}estimate is 0.436. If the raters are considered fixed, so that inference is limited to the observed set of raters but subjects are considered randomly sampled from an infinite population, then AC_{1}is significantly different from zero (*p*<0.0001) with a 95% confidence interval of (0.230, 0.642). Alternatively, if the raters are considered randomly sampled from an infinite population and subjects are considered fixed, then AC_{1}is significantly different from zero (*p*=0.0280) with confidence interval (0.047, 0.825). If both raters and subjects are samples from infinite populations, then AC_{1}is marginally significant (*p*=0.0522) with confidence interval is (-0.004, 0.876). Use the results appropriate for your study design.Analysis of data from Fleiss (2003) The MAGREE macro Gwet's Agreement Coefficient

Fixed AC1 Standard

ErrorZ Value Pr > |Z| Lower

Confidence

LimitUpper

Confidence

LimitRaters 0.43587 0.10511 4.14687 <.0001 0.22986 0.64187 Items 0.43587 0.19836 2.19732 0.0280 0.04708 0.82465 0.43587 0.22449 1.94159 0.0522 -0.00412 0.87586 **Ordinal response**If the response is considered ordinal then Gwet's AC

_{2}, the GLMM-based statistics κ_{m}and κ_{ma}, and Kendall's coefficient of concordance, W, can be used which take the ordering into account when assessing the agreement among the raters. AC_{2}and W also use the spacing of the categories. Fleiss' kappa and/or Gwet's AC_{1}statistic could also be used, but they do not take the ordinal nature of the response into account, effectively treating them as nominal.In the following macro calls,

**stat=ordinal**is specified to compute all statistics appropriate for an ordinal response. The**nosummary**option is specified and the**counts**option is removed to avoid repeating those tables shown above. Kendall's W and κ_{m}for ordinal data are provided when**weight=**is not specified since they do not make use of weights. Also, the**wcl**option is included to provide a jackknife-based estimate of the standard deviation of Kendall's W and a 95% confidence interval. Specify**alpha=**if a different confidence level is desired.%magree(data=ratings, items=s, raters=r, response=y, stat=ordinal, options=nosummary wcl)

Assuming an ordinal response, the GLMM-based κ

_{m}estimate of exact agreement is 0.214 with a 95% confidence interval of (0.063, 0.365). The estimate of Kendall's W is 0.491 with 95% confidence interval (0.191, 0.790).The MAGREE macro GLMM-based ordinal measure Kappa _{m}

Kappa_m Standard

ErrorLower

Confidence

LimitUpper

Confidence

LimitRho Rho

Standard

Error0.21408 0.077221 0.062731 0.36543 0.45979 0.13653

Kendall's Coefficient of Concordance for ordinal response

Coefficient

of

ConcordanceF Num DF Den DF Prob>F Standard

ErrorLower

Confidence

LimitUpper

Confidence

Limit0.49058 3.852 8.6 34.4 0.0021 0.15299 0.19073 0.79044 To compute Gwet's AC

_{2}and κ_{ma}, a weight matrix is specified to allow for partial agreement to contribute. A linear weight matrix is specified with**weight=linear**which prevents Kendall's W and κ_{m}from being computed.%magree(data=ratings, items=s, raters=r, response=y, stat=ordinal, weight=linear, options=nosummary)

A table of Gwet's AC

_{2}statistics now appears following display of the linear weight matrix that was requested. The values in this matrix indicate the amount of partial agreement that is considered to exist for each possible disagreement in rating. The 0.5 value indicates partial agreement for a difference of one rating level. However, there is no partial agreement for a difference of two levels.Using these weights, the AC

_{2}estimate is 0.298. As for the AC_{1}results above, a test and confidence interval is provided depending on whether raters, subjects, or neither are considered fixed. The table showing the GLMM-based κ_{ma}is displayed next. Its estimated value is 0.314 with confidence interval (0.112, 0.496). The value of ρ is a moderate 0.460 suggesting the neither source of variability (raters or items) is dominant.The MAGREE macro Weight matrix for Gwet's AC _{2}and/or GLMM Kappa_{ma}

w1 w2 w3 1.0 0.5 0.0 0.5 1.0 0.5 0.0 0.5 1.0

Gwet's Agreement Coefficient

Fixed AC2 Standard

ErrorZ Value Pr > |Z| Lower

Confidence

LimitUpper

Confidence

LimitRaters 0.29825 0.15287 1.95096 0.0511 -0.00138 0.59787 Items 0.29825 0.21150 1.41013 0.1585 -0.11629 0.71278 0.29825 0.26096 1.14286 0.2531 -0.21324 0.80973

Weighted GLMM-based ordinal measure Kappa _{ma}

Kappa_ma Standard

ErrorLower

Confidence

LimitUpper

Confidence

LimitRho Rho

Standard

Error0.30415 0.097874 0.11232 0.49598 0.45979 0.13653 **Example 2: Assessing individual rater and item effects**- The κ
_{m}and κ_{ma}statistics are based on a mixed model which includes random effects for the raters and for the items. Estimates of these random effects can be used to further investigate the effects of individual raters and/or items on departures from agreement.Assuming that the response in the above data is ordinal, the following macro call requests the κ

_{m}statistic and both the**ratereffects**and the**itemeffects**options to display tables of the individual effects. Note that these effects are the same whether they are then used to compute κ_{m}for exact agreement or κ_{ma}for partial agreement (association).%magree(data=ratings, items=s, raters=r, response=y, stat=glmm, options=ratereffects itemeffects)

The GLMM-based measure of exact agreement is κ

_{m}=0.214 with confidence interval (0.063, 0.365) suggesting a small amount of exact agreement. Contrast this with the slightly stronger measure of partial agreement (κ_{ma}=0.304) using linear weights shown above.Starting with the final table of rater effects, note that raters 1 and 5 are the most divergent from the others though neither differs significantly from zero. The positive value for rater 1 suggests that rater might tend to rate items more toward the high end of the scale, while the negative value for rater 5 suggests that rater tends more toward the lower end. Similarly in the item effects table, the effect of subject 3 is significantly negative (

*p*=0.047) suggesting that subject pretty clearly belongs at the lower end of the scale. Subject 7 however has a marginally significant positive effect (*p*=0.057) indicating a higher position on the scale.The MAGREE macro GLMM-based ordinal measure Kappa _{m}

Kappa_m Standard

ErrorLower

Confidence

LimitUpper

Confidence

LimitRho Rho

Standard

Error0.21408 0.077221 0.062731 0.36543 0.45979 0.13653

Estimated Item (S) random effects

Subject Estimate Std Err Pred DF t Value Pr > |t| Alpha Lower Upper s 1 0.2599 0.6218 35 0.42 0.6785 0.05 -1.0024 1.5222 s 2 -0.4084 0.6638 35 -0.62 0.5424 0.05 -1.7559 0.9391 s 3 -2.0934 1.0165 35 -2.06 0.0470 0.05 -4.1570 -0.02981 s 4 0.9941 0.7248 35 1.37 0.1789 0.05 -0.4773 2.4656 s 5 0.2664 0.6619 35 0.40 0.6898 0.05 -1.0773 1.6101 s 6 0.2599 0.6218 35 0.42 0.6785 0.05 -1.0024 1.5222 s 7 1.9624 0.9962 35 1.97 0.0568 0.05 -0.06006 3.9849 s 8 -0.4058 0.6241 35 -0.65 0.5198 0.05 -1.6728 0.8612 s 9 -1.1363 0.7377 35 -1.54 0.1325 0.05 -2.6338 0.3612 s 10 0.2664 0.6619 35 0.40 0.6898 0.05 -1.0773 1.6101

Estimated Rater (R) random effects

Subject Estimate Std Err Pred DF t Value Pr > |t| Alpha Lower Upper r 1 1.2621 0.6707 35 1.88 0.0682 0.05 -0.09952 2.6238 r 2 0.4039 0.5730 35 0.70 0.4855 0.05 -0.7592 1.5671 r 3 0.07079 0.5651 35 0.13 0.9010 0.05 -1.0765 1.2180 r 4 -0.5890 0.5826 35 -1.01 0.3189 0.05 -1.7717 0.5937 r 5 -1.1684 0.6423 35 -1.82 0.0775 0.05 -2.4725 0.1356 **Example 3: Per category agreement**- Agreement of the raters on each response category can be assessed by specifying the
**percategory**option. The following macro call provides kappa and AC1 estimates for each category as well as overall. The Kendall and GLMM-based results (if requested specifically or with**stat=all**) are not affected by this option. The estimates for a category can be obtained by dichotomizing the response such that all categories other than the one in question are combined.%magree(data=ratings, items=s, raters=r, response=y, stat=nominal, options=percategory)

The statistics for the individual response categories indicate the degree of agreement on each category. In this example, agreement is strongest on category 2.

The MAGREE macro Kappa statistics for nominal response

y Kappa Standard

Error|H0Z Prob>|Z| Standard

ErrorLower

Confidence

LimitUpper

Confidence

Limit1 0.29167 0.10000 2.91667 0.0018 0.15546 -0.01303 0.59636 2 0.67105 0.10000 6.71053 <.0001 0.05018 0.57271 0.76940 3 0.34896 0.10000 3.48958 0.0002 0.17249 0.01089 0.68703 Overall 0.41789 0.07165 5.83220 <.0001 0.10383 0.21439 0.62139

Gwet's Agreement Coefficient

y Fixed AC1 Standard

ErrorZ Value Pr > |Z| Lower

Confidence

LimitUpper

Confidence

Limit1 Raters 0.55263 0.14650 3.77224 0.0002 0.26550 0.83977 1 Items 0.55263 0.16593 3.33041 0.0009 0.22741 0.87786 1 0.55263 0.22135 2.49663 0.0125 0.11879 0.98647 2 Raters 0.85323 0.09996 8.53577 <.0001 0.65731 1.00000 2 Items 0.85323 0.09518 8.96395 <.0001 0.66667 1.00000 2 0.85323 0.13803 6.18153 <.0001 0.58270 1.00000 3 Raters 0.61019 0.14624 4.17242 <.0001 0.32356 0.89682 3 Items 0.61019 0.13142 4.64289 <.0001 0.35260 0.86777 3 0.61019 0.19662 3.10339 0.0019 0.22482 0.99555 Overall Raters 0.43587 0.10511 4.14687 <.0001 0.22986 0.64187 Overall Items 0.43587 0.19836 2.19732 0.0280 0.04708 0.82465 Overall 0.43587 0.22449 1.94159 0.0522 -0.00412 0.87586 **Example 4: Unequal numbers of ratings per item**-
When some raters do not rate some items, not all statistics can be computed. If any item has only a single rating then the kappa statistics cannot be produced. However, as long as every item is rated by two or more raters, then kappa estimates and confidence intervals are provided, but tests are only available if all items are rated by all raters. Kappa statistics are also not available if some rater rates some item more than once.
Gwet statistics also are not available if some rater rates some item more than once. But estimates, tests, and confidence intervals are available even if some raters don't rate some items. Note that tests and confidence intervals for the fixed-items and unconditional study designs require that there be at least three raters in the study data.

GLMM-based statistics are available if some raters do not rate some items, but not if any rater rates an item more than once.

Kendall's statistic cannot be provided if some raters do not rate some items, or if any rater rates an item more than once.

The following modification of Example 1 illustrates the results when some raters do not rate some items.

title "Analysis of Fleiss (2003) data with missings"; data missings; do S=1 to 10; do R=1 to 5; input Y @@; output; end; end; datalines; 1 2 2 2 . 1 1 3 3 3 . 3 3 3 3 1 1 1 1 3 1 1 1 3 3 1 2 . . . 1 1 1 1 1 2 2 . . 3 1 3 3 3 3 1 1 1 3 3 ; %magree(data=missings, items=s, raters=r, response=y, stat=all)

These messages appear in the log indicating that the missing ratings prevent the Kendall statistics and tests for the kappa statistics from being computed.

NOTE: Tests for kappa are only available when all items have an equal number of ratings. NOTE: To compute the Kendall statistic, each rater must rate each item exactly once.

The multiple rows in the "Ratings Summary" table immediately indicates that some raters did not rate some items. In particular, only six items were rated by all five raters. One item had only one rating and another item had only three. Two items had four ratings.

Analysis of Fleiss (2003) data with missings The MAGREE macro Ratings Summary

Ratings

CountItems 2 1 3 1 4 2 5 6 The results show the kappa statistics with confidence intervals followed by the Gwet statistics and the GLMM-based statistics. Note that tests for the kappa statistics are not provided. Also as noted above, the Kendall statistics could not be computed due to the missing ratings.

Analysis of Fleiss (2003) data with missings The MAGREE macro Kappa statistics for nominal response

Y Kappa Standard

ErrorLower

Confidence

LimitUpper

Confidence

LimitOverall 0.24894 0.12985 -.005554655 0.50344

Gwet's Agreement Coefficient

Fixed AC1 Standard

ErrorZ Value Pr > |Z| Lower

Confidence

LimitUpper

Confidence

LimitRaters 0.30176 0.15076 2.00154 0.0453 0.00627 0.59725 Items 0.30176 0.20061 1.50424 0.1325 -0.09142 0.69494 0.30176 0.25094 1.20250 0.2292 -0.19008 0.79360

GLMM-based ordinal measure Kappa _{m}

Kappa_m Standard

ErrorLower

Confidence

LimitUpper

Confidence

LimitRho Rho

Standard

Error0.18171 0.079854 0.025201 0.33822 0.40067 0.15051 **Example 5: Ordinal data – Weight matrices for Gwet's AC**_{2}and κ_{ma}- When the response has ordered categories, any disagreement in the ratings from two raters can be small or large as measured by the difference in the category values. You might consider small differences to represent partial agreement and increasingly larger differences to represent diminishing amounts of agreement with very large differences representing no agreement. This can be done by defining a weight matrix which is a symmetric,
*k*×*k*matrix of values ranging from 1 to 0, where*k*is the number of response categories. Each element in the matrix essentially quantifies the amount of partial agreement that a pair of ratings has. Consequently, the diagonal elements where both ratings are the same always equal 1. Elements off of but near the diagonal typically are assigned larger values since these elements represent smaller differences and more agreement in ratings than elements far from the diagonal. Often, the extreme corners of the matrix are assigned zero weights.You can specify

**weight=**to define a weight matrix for use in computing the AC_{2}and κ_{ma}statistics for ordinal responses. Optionally,**wtparm=**can be specified in addition to**weight=**. The Usage and Details sections of this documentation describe how the weights are computed based on the specified values. Alternatively, you can specify your own weight matrix by saving it in a data set and specifying the name of that data set in**weight=**.The following illustrates some of the weight matrices that can be generated and the effect on the AC

_{2}and κ_{ma}statistics. The following macro call computes the unweighted AC_{1}and κ_{m}statistics for a data set of 15 items and 10 raters using a 5-point ordinal response with categories valued 1, 2, 3, 4, or 5.%magree(data=a, items=s, raters=r, response=y, stat=gwet glmm, options=counts nosummary)

There is noticeable agreement shown in the rating counts table. For example, nine of the ten raters agreed that item one belongs in category 2. For item 2, eight of ten agree on category 4, while two others choose categories 2 and 3. The resulting unweighted AC

_{1}agreement coefficient estimate is 0.536 indicating a moderate level of exact agreement and is significantly different from zero (p<0.0001). The GLMM-based κ_{m}finds a weaker degree of exact agreement, estimated at 0.164 with confidence interval (0.085, 0.242).The MAGREE macro Rating counts per item

The FREQ ProcedureFrequency Table of s by y s y 1 2 3 4 5 Total 1 0 9 0 1 0 10 2 0 1 1 8 0 10 3 0 8 0 2 0 10 4 10 0 0 0 0 10 5 0 9 0 0 1 10 6 1 7 2 0 0 10 7 1 0 8 0 1 10 8 1 1 6 1 1 10 9 0 0 2 8 0 10 10 0 2 7 0 1 10 11 0 2 7 1 0 10 12 0 9 0 1 0 10 13 1 1 0 8 0 10 14 8 0 1 0 1 10 15 1 6 2 1 0 10 Total 23 55 36 31 5 150

Gwet's Agreement Coefficient

Fixed AC1 Standard

ErrorZ Value Pr > |Z| Lower

Confidence

LimitUpper

Confidence

LimitRaters 0.53638 0.056783 9.44606 <.0001 0.42509 0.64767 Items 0.53638 0.071871 7.46309 <.0001 0.39552 0.67725 0.53638 0.091596 5.85594 <.0001 0.35686 0.71591

GLMM-based ordinal measure Kappa _{m}

Kappa_m Standard

ErrorLower

Confidence

LimitUpper

Confidence

LimitRho Rho

Standard

Error0.16362 0.040056 0.085110 0.24213 0.52942 0.091248 Suppose it is felt that agreement should be estimated allowing for partial agreement depending on the size of the difference in ratings. One way to do that is with a linear weight matrix which can be requested by

**weight=linear**, or equivalently,**weight=1**.%magree(data=a, items=s, raters=r, response=y, stat=gwet glmm, weight=linear, options=nosummary)

Notice that the weight values decrease progressively to zero with increasing distance from the diagonal. The even steps in value in bands parallel to the diagonal occur because of the evenly-spaced nature of the response categories in this example. Notice also that allowing for partial agreement has increased the agreement coefficient, AC

_{2}, to 0.637, and the GLMM-based κ_{ma}to 0.355 because of the weighting.The MAGREE macro Weight matrix for Gwet's AC _{2}and/or GLMM Kappa_{ma}

w1 w2 w3 w4 w5 1.00 0.75 0.50 0.25 0.00 0.75 1.00 0.75 0.50 0.25 0.50 0.75 1.00 0.75 0.50 0.25 0.50 0.75 1.00 0.75 0.00 0.25 0.50 0.75 1.00

Gwet's Agreement Coefficient

Fixed AC2 Standard

ErrorZ Value Pr > |Z| Lower

Confidence

LimitUpper

Confidence

LimitRaters 0.63674 0.051262 12.4213 <.0001 0.53627 0.73721 Items 0.63674 0.084428 7.5418 <.0001 0.47126 0.80221 0.63674 0.098772 6.4466 <.0001 0.44315 0.83033

Weighted GLMM-based ordinal measure Kappa _{ma}

Kappa_ma Standard

ErrorLower

Confidence

LimitUpper

Confidence

LimitRho Rho

Standard

Error0.35518 0.068474 0.22097 0.48939 0.52942 0.091248 Another possible weight matrix is the quadratic matrix which could also be requested by

**weight=2**.%magree(data=a, items=s, raters=r, response=y, stat=gwet glmm, weight=quadratic, options=nosummary)

Notice that the larger power (2 vs 1) for this matrix increases the weight values meaning that more agreement is contributed by those differences. As a result, AC

_{2}further increases to 0.727. κ_{ma}is essentially unchanged from the linear weight results.The MAGREE macro Weight matrix for Gwet's AC _{2}and/or GLMM Kappa_{ma}

w1 w2 w3 w4 w5 1.0000 0.9375 0.7500 0.4375 0.0000 0.9375 1.0000 0.9375 0.7500 0.4375 0.7500 0.9375 1.0000 0.9375 0.7500 0.4375 0.7500 0.9375 1.0000 0.9375 0.0000 0.4375 0.7500 0.9375 1.0000

Gwet's Agreement Coefficient

Fixed AC2 Standard

ErrorZ Value Pr > |Z| Lower

Confidence

LimitUpper

Confidence

LimitRaters 0.72677 0.06389 11.3756 <.0001 0.60155 0.85198 Items 0.72677 0.09142 7.9500 <.0001 0.54759 0.90594 0.72677 0.11153 6.5164 <.0001 0.50817 0.94536

The MAGREE macro Weighted GLMM-based ordinal measure Kappa _{ma}

Kappa_ma Standard

ErrorLower

Confidence

LimitUpper

Confidence

LimitRho Rho

Standard

Error0.35519 0.068474 0.22098 0.48940 0.52942 0.091248 The square root matrix,

**weight=sqrt**or**weight=0.5**, produces smaller weights than the linear and quadratic matrices.%magree(data=a, items=s, raters=r, response=y, stat=gwet glmm, weight=sqrt, options=nosummary)

The smaller contributions to agreement reduce AC

_{2}to 0.586, but κ_{ma}remains unchanged at 0.355.The MAGREE macro Weight matrix for Gwet's AC _{2}and/or GLMM Kappa_{ma}

w1 w2 w3 w4 w5 1.00000 0.50000 0.29289 0.13397 0.00000 0.50000 1.00000 0.50000 0.29289 0.13397 0.29289 0.50000 1.00000 0.50000 0.29289 0.13397 0.29289 0.50000 1.00000 0.50000 0.00000 0.13397 0.29289 0.50000 1.00000

Gwet's Agreement Coefficient

Fixed AC2 Standard

ErrorZ Value Pr > |Z| Lower

Confidence

LimitUpper

Confidence

LimitRaters 0.58567 0.048946 11.9657 <.0001 0.48974 0.68160 Items 0.58567 0.078569 7.4542 <.0001 0.43168 0.73966 0.58567 0.092568 6.3269 <.0001 0.40424 0.76710

Weighted GLMM-based ordinal measure Kappa _{ma}

Kappa_ma Standard

ErrorLower

Confidence

LimitUpper

Confidence

LimitRho Rho

Standard

Error0.35517 0.068474 0.22097 0.48938 0.52942 0.091248 More control, in the sense of more rapidly dropping weights at larger differences, can be done by also specifying

**wtparm=**. This can be used along with any of the**weight=**values (except when**weight=***data-set-name*). This involves quantiles from an exponential distribution with parameter equal to the**wtparm=**value. Typically, suitable results can be obtained with values between 0.01 and 2. A parameter of 1 is used along with the linear matrix in this next call to the macro.%magree(data=a, items=s, raters=r, response=y, stat=gwet glmm, weight=linear, wtparm=1, options=nosummary)

Compared to the linear matrix alone, the weight diminishes quickly, becoming zero for the second largest difference in addition to the largest. The AC

_{2}estimate is 0.614 which is slightly less than with the linear matrix alone. κ_{ma}remains unchanged at 0.355.The MAGREE macro Weight matrix for Gwet's AC _{2}and/or GLMM Kappa_{ma}

w1 w2 w3 w4 w5 1.00000 0.71232 0.30685 0.00000 0.00000 0.71232 1.00000 0.71232 0.30685 0.00000 0.30685 0.71232 1.00000 0.71232 0.30685 0.00000 0.30685 0.71232 1.00000 0.71232 0.00000 0.00000 0.30685 0.71232 1.00000

Gwet's Agreement Coefficient

Fixed AC2 Standard

ErrorZ Value Pr > |Z| Lower

Confidence

LimitUpper

Confidence

LimitRaters 0.61407 0.049157 12.4920 <.0001 0.51772 0.71041 Items 0.61407 0.086580 7.0924 <.0001 0.44437 0.78376 0.61407 0.099562 6.1677 <.0001 0.41893 0.80920

Weighted GLMM-based ordinal measure Kappa _{ma}

Kappa_ma Standard

ErrorLower

Confidence

LimitUpper

Confidence

LimitRho Rho

Standard

Error0.35517 0.068474 0.22097 0.48938 0.52942 0.091248 Note that if the

**percategory**option were used in the above, the category-level Gwet AC_{2}results would not change with different weights since those results are based on a dichotomized response. With a two-level response, the ordinality is lost and so weights are not used.

Right-click on the link below and select **Save** to save
the MAGREE macro definition
to a file. It is recommended that you name the file
`magree.sas`

.

Estimate and test agreement among multiple raters when ratings are nominal or ordinal. For nominal responses, kappa and Gwet's AC1 agreement coefficient are available. For ordinal responses, Gwet's weighted AC2, Kendall's coefficient of concordance, and GLMM-based statistics are available.

#### Operating System and Release Information

Type: | Sample |

Topic: | SAS Reference ==> Procedures ==> FREQ Analytics ==> Nonparametric Analysis SAS Reference ==> Macro Analytics ==> Descriptive Statistics Analytics ==> Categorical Data Analysis |

Date Modified: | 2019-09-20 16:18:19 |

Date Created: | 2005-01-13 15:03:24 |

Product Family | Product | Host | SAS Release | |

Starting | Ending | |||

SAS System | SAS/STAT | All | n/a | n/a |