- Example 1: Estimating and testing rater agreement
-
This example is from Fleiss et. al. (2003, Table 18.8). Ten subjects
(S) are each rated into one of three categories (Y) by each of five
raters (R). The following DATA step creates a data set with one observation per rating. The response, Y, is numeric allowing it to be treated as nominal or ordinal so that all statistics – Fleiss' kappa, Gwet's agreement coefficient, the GLMM-based statistics, and Kendall's coefficient of concordance – can be computed.
title "Analysis of data from Fleiss (2003)";
data ratings;
do s=1 to 10;
do r=1 to 5;
input y @@;
output;
end; end;
datalines;
1 2 2 2 2
1 1 3 3 3
3 3 3 3 3
1 1 1 1 3
1 1 1 3 3
1 2 2 2 2
1 1 1 1 1
2 2 2 2 3
1 3 3 3 3
1 1 1 3 3
;
The following displays the data for the first two subjects showing the form of the data needed by the MAGREE macro with one observation per rating.
proc print data=ratings;
where s in (1,2);
id s r;
title2 'Required form of input data set';
run;
Nominal response
The following statements use the %inc statement to define the MAGREE macro in your SAS session. The macro needs to be defined only once per SAS session. The macro is then called to compute the agreement statistics. stat=nominal is specified to compute all statistics appropriate for a nominal response which includes the kappa statistic and Gwet's AC1. The counts option is included to display the table showing, for each subject, the number of ratings in each response category.
%inc "<location of your file containing the MAGREE macro>";
%magree(data=ratings,
items=s, raters=r, response=y,
stat=nominal, options=counts)
The "Ratings Summary" table shows the number of subjects that have specific numbers of ratings. This table details any departure from balance. Balance occurs when all subjects have equal numbers of ratings. These data are balanced with all ten subjects rated once by all raters so that each subject has five ratings. The "Rating counts per item" table reproduces the table shown in Fleiss et. al. (2003).
Following these data summary tables, the table of Fleiss' kappa statistics appears. The overall kappa estimate, 0.418, is significantly different from zero (p<0.0001). A 95% large-sample confidence interval for the overall kappa is (0.214, 0.621).
0.41789 |
0.071653 |
5.83220 |
<.0001 |
0.10383 |
0.21439 |
0.62139 |
|
The table of Gwet agreement statistics is given next. Since the response is considered nominal, no weight matrix was specified. As a result, the AC1 statistic is produced. The AC1 estimate is 0.436. If the raters are considered fixed, so that inference is limited to the observed set of raters but subjects are considered randomly sampled from an infinite population, then AC1 is significantly different from zero (p<0.0001) with a 95% confidence interval of (0.230, 0.642). Alternatively, if the raters are considered randomly sampled from an infinite population and subjects are considered fixed, then AC1 is significantly different from zero (p=0.0280) with confidence interval (0.047, 0.825). If both raters and subjects are samples from infinite populations, then AC1 is marginally significant (p=0.0522) with confidence interval is (-0.004, 0.876). Use the results appropriate for your study design.
0.43587 |
0.10511 |
4.14687 |
<.0001 |
0.22986 |
0.64187 |
0.43587 |
0.19836 |
2.19732 |
0.0280 |
0.04708 |
0.82465 |
0.43587 |
0.22449 |
1.94159 |
0.0522 |
-0.00412 |
0.87586 |
|
Ordinal response
If the response is considered ordinal then Gwet's AC2, the GLMM-based statistics κm and κma, and Kendall's coefficient of concordance, W, can be used which take the ordering into account when assessing the agreement among the raters. AC2 and W also use the spacing of the categories. Fleiss' kappa and/or Gwet's AC1 statistic could also be used, but they do not take the ordinal nature of the response into account, effectively treating them as nominal.
In the following macro calls, stat=ordinal is specified to compute all statistics appropriate for an ordinal response. The nosummary option is specified and the counts option is removed to avoid repeating those tables shown above. Kendall's W and κm for ordinal data are provided when weight= is not specified since they do not make use of weights. Also, the wcl option is included to provide a jackknife-based estimate of the standard deviation of Kendall's W and a 95% confidence interval. Specify alpha= if a different confidence level is desired.
%magree(data=ratings,
items=s, raters=r, response=y,
stat=ordinal, options=nosummary wcl)
Assuming an ordinal response, the GLMM-based κm estimate of exact agreement is 0.214 with a 95% confidence interval of (0.063, 0.365).
The estimate of Kendall's W is 0.491 with 95% confidence interval (0.191, 0.790). Gwet's AC2 cannot be computed since the required weight matrix is
not specified. That is addressed next.
0.21408 |
0.077221 |
0.062731 |
0.36543 |
0.45979 |
0.13653 |
3.852 |
8.6 |
34.4 |
0.0021 |
0.15299 |
0.19073 |
0.79044 |
|
To compute Gwet's AC2 and κma, a weight matrix is specified to allow for partial agreement to contribute. A linear weight matrix is specified with weight=linear which prevents Kendall's W and κm from being computed.
%magree(data=ratings,
items=s, raters=r, response=y,
stat=ordinal, weight=linear, options=nosummary)
A table of Gwet's AC2 statistics now appears following display of the linear weight matrix that was requested. The values in this matrix indicate the amount of partial agreement that is considered to exist for each possible disagreement in rating. The 0.5 value indicates partial agreement for a difference of one rating level. However, there is no partial agreement for a difference of two levels.
Using these weights, the AC2 estimate is 0.298. As for the AC1 results above, a test and confidence interval is provided depending on whether raters, subjects, or neither are considered fixed. The table showing the GLMM-based κma is displayed next. Its estimated value is 0.304 with confidence interval (0.112, 0.496). The value of ρ is a moderate 0.460 suggesting the neither source of variability (raters or items) is dominant.
1.0 |
0.5 |
0.0 |
0.5 |
1.0 |
0.5 |
0.0 |
0.5 |
1.0 |
0.29825 |
0.15287 |
1.95096 |
0.0511 |
-0.00138 |
0.59787 |
0.29825 |
0.21150 |
1.41013 |
0.1585 |
-0.11629 |
0.71278 |
0.29825 |
0.26096 |
1.14286 |
0.2531 |
-0.21324 |
0.80973 |
0.30415 |
0.097874 |
0.11232 |
0.49598 |
0.45979 |
0.13653 |
|
- Example 2: Assessing individual rater and item effects
- The κm and κma statistics are based on a mixed model which includes random effects for the raters and for the items. Estimates of these random effects can be used to further investigate the effects of individual raters and/or items on departures from agreement.
Assuming that the response in the above data is ordinal, the following macro call requests the κm statistic and both the ratereffects and the itemeffects options to display tables of the individual effects. Note that these effects are the same whether they are then used to compute κm for exact agreement or κma for partial agreement (association).
%magree(data=ratings, items=s, raters=r, response=y,
stat=glmm, options=ratereffects itemeffects)
The GLMM-based measure of exact agreement is κm=0.214 with confidence interval (0.063, 0.365) suggesting a small amount of exact agreement.
Contrast this with the slightly stronger measure of partial agreement (κma=0.304) using linear weights shown above. The value of rho, 0.46,
indicates that the variability among items is about the same as the variability among raters.
Starting with the final table of rater effects, note that raters 1 and 5 are the most divergent from the others though neither differs significantly from zero.
The negative value for rater 1 suggests that rater tends to rate items more toward the low end of the scale, while the positive value for rater 5 suggests
that rater tends more toward the higher end. Similarly in the item effects table, the effect of subject 3 is significantly positive (p=0.047)
suggesting that subject tends toward the high end of the scale. Subject 7 however has a marginally significant negative effect (p=0.057) indicating a
lower position on the scale.
0.21408 |
0.077221 |
0.062731 |
0.36543 |
0.45979 |
0.13653 |
-0.2599 |
0.6218 |
35 |
-0.42 |
0.6785 |
0.05 |
-1.5222 |
1.0024 |
0.4084 |
0.6638 |
35 |
0.62 |
0.5424 |
0.05 |
-0.9391 |
1.7559 |
2.0934 |
1.0165 |
35 |
2.06 |
0.0470 |
0.05 |
0.02982 |
4.1570 |
-0.9941 |
0.7248 |
35 |
-1.37 |
0.1789 |
0.05 |
-2.4656 |
0.4773 |
-0.2664 |
0.6619 |
35 |
-0.40 |
0.6898 |
0.05 |
-1.6101 |
1.0773 |
-0.2599 |
0.6218 |
35 |
-0.42 |
0.6785 |
0.05 |
-1.5222 |
1.0024 |
-1.9624 |
0.9962 |
35 |
-1.97 |
0.0568 |
0.05 |
-3.9849 |
0.06007 |
0.4058 |
0.6241 |
35 |
0.65 |
0.5198 |
0.05 |
-0.8612 |
1.6728 |
1.1363 |
0.7377 |
35 |
1.54 |
0.1325 |
0.05 |
-0.3612 |
2.6338 |
-0.2664 |
0.6619 |
35 |
-0.40 |
0.6898 |
0.05 |
-1.6101 |
1.0773 |
-1.2621 |
0.6707 |
35 |
-1.88 |
0.0682 |
0.05 |
-2.6238 |
0.09953 |
-0.4039 |
0.5730 |
35 |
-0.70 |
0.4855 |
0.05 |
-1.5671 |
0.7592 |
-0.07079 |
0.5651 |
35 |
-0.13 |
0.9010 |
0.05 |
-1.2180 |
1.0765 |
0.5890 |
0.5826 |
35 |
1.01 |
0.3189 |
0.05 |
-0.5937 |
1.7717 |
1.1684 |
0.6423 |
35 |
1.82 |
0.0775 |
0.05 |
-0.1356 |
2.4724 |
|
- Example 3: Per category agreement
- Agreement of the raters on each response category can be assessed by specifying the percategory option. The following macro call provides kappa and AC1 estimates for each category as well as overall. The Kendall and GLMM-based results (if requested specifically or with stat=all) are not affected by this option. The estimates for a category can be obtained by dichotomizing the response such that all categories other than the one in question are combined.
%magree(data=ratings,
items=s, raters=r, response=y,
stat=nominal, options=percategory)
The statistics for the individual response categories indicate the degree of agreement on each category. In this example, agreement is strongest on category 2.
0.29167 |
0.10000 |
2.91667 |
0.0018 |
0.15546 |
-0.01303 |
0.59636 |
0.67105 |
0.10000 |
6.71053 |
<.0001 |
0.05018 |
0.57271 |
0.76940 |
0.34896 |
0.10000 |
3.48958 |
0.0002 |
0.17249 |
0.01089 |
0.68703 |
0.41789 |
0.07165 |
5.83220 |
<.0001 |
0.10383 |
0.21439 |
0.62139 |
0.55263 |
0.14650 |
3.77224 |
0.0002 |
0.26550 |
0.83977 |
0.55263 |
0.16593 |
3.33041 |
0.0009 |
0.22741 |
0.87786 |
0.55263 |
0.22135 |
2.49663 |
0.0125 |
0.11879 |
0.98647 |
0.85323 |
0.09996 |
8.53577 |
<.0001 |
0.65731 |
1.00000 |
0.85323 |
0.09518 |
8.96395 |
<.0001 |
0.66667 |
1.00000 |
0.85323 |
0.13803 |
6.18153 |
<.0001 |
0.58270 |
1.00000 |
0.61019 |
0.14624 |
4.17242 |
<.0001 |
0.32356 |
0.89682 |
0.61019 |
0.13142 |
4.64289 |
<.0001 |
0.35260 |
0.86777 |
0.61019 |
0.19662 |
3.10339 |
0.0019 |
0.22482 |
0.99555 |
0.43587 |
0.10511 |
4.14687 |
<.0001 |
0.22986 |
0.64187 |
0.43587 |
0.19836 |
2.19732 |
0.0280 |
0.04708 |
0.82465 |
0.43587 |
0.22449 |
1.94159 |
0.0522 |
-0.00412 |
0.87586 |
|
- Example 4: Unequal numbers of ratings per item
-
When some raters do not rate some items, not all statistics can be computed. If any item has only a single rating then the kappa statistics cannot be produced. However, as long as every item is rated by two or more raters, then kappa estimates and confidence intervals are provided, but tests are only available if all items are rated by all raters. Kappa statistics are also not available if some rater rates some item more than once.
Gwet statistics also are not available if some rater rates some item more than once. But estimates, tests, and confidence intervals are available even if some raters don't rate some items. Note that tests and confidence intervals for the fixed-items and unconditional study designs require that there be at least three raters in the study data.
GLMM-based statistics are available if some raters do not rate some items, but not if any rater rates an item more than once.
Kendall's statistic cannot be provided if some raters do not rate some items, or if any rater rates an item more than once.
The following modification of Example 1 illustrates the results when some raters do not rate some items.
title "Analysis of Fleiss (2003) data with missings";
data missings;
do S=1 to 10;
do R=1 to 5;
input Y @@;
output;
end; end;
datalines;
1 2 2 2 .
1 1 3 3 3
. 3 3 3 3
1 1 1 1 3
1 1 1 3 3
1 2 . . .
1 1 1 1 1
2 2 . . 3
1 3 3 3 3
1 1 1 3 3
;
%magree(data=missings, items=s, raters=r, response=y, stat=all)
These messages appear in the log indicating that the missing ratings prevent the Kendall statistics and tests for the kappa statistics from being computed.
NOTE: Tests for kappa are only available when all items have an equal
number of ratings.
NOTE: To compute the Kendall statistic, each rater must rate each
item exactly once.
The multiple rows in the "Ratings Summary" table immediately indicates that some raters did not rate some items. In particular, only six items were rated by all five raters. One item had only one rating and another item had only three. Two items had four ratings.
The results show the kappa statistics with confidence intervals followed by the Gwet statistics and the GLMM-based statistics. Note that tests for the kappa statistics are not provided. Also as noted above, the Kendall statistics could not be computed due to the missing ratings.
0.24894 |
0.12985 |
-.005554655 |
0.50344 |
0.30176 |
0.15076 |
2.00154 |
0.0453 |
0.00627 |
0.59725 |
0.30176 |
0.20061 |
1.50424 |
0.1325 |
-0.09142 |
0.69494 |
0.30176 |
0.25094 |
1.20250 |
0.2292 |
-0.19008 |
0.79360 |
0.18171 |
0.079854 |
0.025201 |
0.33822 |
0.40067 |
0.15051 |
|
- Example 5: Ordinal data – Weight matrices for Gwet's AC2 and κma
- When the response has ordered categories, any disagreement in the ratings from two raters can be small or large as measured by the difference in the category values. You might consider small differences to represent partial agreement and increasingly larger differences to represent diminishing amounts of agreement with very large differences representing no agreement. This can be done by defining a weight matrix which is a symmetric, k×k matrix of values ranging from 1 to 0, where k is the number of response categories. Each element in the matrix essentially quantifies the amount of partial agreement that a pair of ratings has. Consequently, the diagonal elements where both ratings are the same always equal 1. Elements off of but near the diagonal typically are assigned larger values since these elements represent smaller differences and more agreement in ratings than elements far from the diagonal. Often, the extreme corners of the matrix are assigned zero weights.
You can specify weight= to define a weight matrix for use in computing the AC2 and κma statistics for ordinal responses. Optionally, wtparm= can be specified in addition to weight=. The Usage and Details sections of this documentation describe how the weights are computed based on the specified values. Alternatively, you can specify your own weight matrix by saving it in a data set and specifying the name of that data set in weight=.
The following illustrates some of the weight matrices that can be generated and the effect on the AC2 and κma statistics. The following macro call computes the unweighted AC1 and κm statistics for a data set of 15 items and 10 raters using a 5-point ordinal response with categories valued 1, 2, 3, 4, or 5.
%magree(data=a,
items=s, raters=r, response=y,
stat=gwet glmm, options=counts nosummary)
There is noticeable agreement shown in the rating counts table. For example, nine of the ten raters agreed that item one belongs in category 2. For item 2, eight of ten agree on category 4, while two others choose categories 2 and 3. The resulting unweighted AC1 agreement coefficient estimate is 0.536 indicating a moderate level of exact agreement and is significantly different from zero (p<0.0001). The GLMM-based κm finds a weaker degree of exact agreement, estimated at 0.164 with confidence interval (0.085, 0.242).
The FREQ Procedure
0.53638 |
0.056783 |
9.44606 |
<.0001 |
0.42509 |
0.64767 |
0.53638 |
0.071871 |
7.46309 |
<.0001 |
0.39552 |
0.67725 |
0.53638 |
0.091596 |
5.85594 |
<.0001 |
0.35686 |
0.71591 |
0.16362 |
0.040056 |
0.085110 |
0.24213 |
0.52942 |
0.091248 |
|
Suppose it is felt that agreement should be estimated allowing for partial agreement depending on the size of the difference in ratings. One way to do that is with a linear weight matrix which can be requested by weight=linear, or equivalently, weight=1.
%magree(data=a,
items=s, raters=r, response=y,
stat=gwet glmm, weight=linear, options=nosummary)
Notice that the weight values decrease progressively to zero with increasing distance from the diagonal. The even steps in value in bands parallel to the diagonal occur because of the evenly-spaced nature of the response categories in this example. Notice also that allowing for partial agreement has increased the agreement coefficient, AC2, to 0.637, and the GLMM-based κma to 0.355 because of the weighting.
1.00 |
0.75 |
0.50 |
0.25 |
0.00 |
0.75 |
1.00 |
0.75 |
0.50 |
0.25 |
0.50 |
0.75 |
1.00 |
0.75 |
0.50 |
0.25 |
0.50 |
0.75 |
1.00 |
0.75 |
0.00 |
0.25 |
0.50 |
0.75 |
1.00 |
0.63674 |
0.051262 |
12.4213 |
<.0001 |
0.53627 |
0.73721 |
0.63674 |
0.084428 |
7.5418 |
<.0001 |
0.47126 |
0.80221 |
0.63674 |
0.098772 |
6.4466 |
<.0001 |
0.44315 |
0.83033 |
0.35518 |
0.068474 |
0.22097 |
0.48939 |
0.52942 |
0.091248 |
|
Another possible weight matrix is the quadratic matrix which could also be requested by weight=quadratic, or equivalently, weight=2.
%magree(data=a,
items=s, raters=r, response=y,
stat=gwet glmm, weight=quadratic, options=nosummary)
Notice that the larger power (2 vs 1) for this matrix increases the weight values meaning that more agreement is contributed by those differences. As a result, AC2 further increases to 0.727. κma is essentially unchanged from the linear weight results.
1.0000 |
0.9375 |
0.7500 |
0.4375 |
0.0000 |
0.9375 |
1.0000 |
0.9375 |
0.7500 |
0.4375 |
0.7500 |
0.9375 |
1.0000 |
0.9375 |
0.7500 |
0.4375 |
0.7500 |
0.9375 |
1.0000 |
0.9375 |
0.0000 |
0.4375 |
0.7500 |
0.9375 |
1.0000 |
0.72677 |
0.06389 |
11.3756 |
<.0001 |
0.60155 |
0.85198 |
0.72677 |
0.09142 |
7.9500 |
<.0001 |
0.54759 |
0.90594 |
0.72677 |
0.11153 |
6.5164 |
<.0001 |
0.50817 |
0.94536 |
0.35519 |
0.068474 |
0.22098 |
0.48940 |
0.52942 |
0.091248 |
|
The square root matrix, weight=sqrt or weight=0.5, produces smaller weights than the linear and quadratic matrices.
%magree(data=a,
items=s, raters=r, response=y,
stat=gwet glmm, weight=sqrt, options=nosummary)
The smaller contributions to agreement reduce AC2 to 0.586, but κma remains unchanged at 0.355.
1.00000 |
0.50000 |
0.29289 |
0.13397 |
0.00000 |
0.50000 |
1.00000 |
0.50000 |
0.29289 |
0.13397 |
0.29289 |
0.50000 |
1.00000 |
0.50000 |
0.29289 |
0.13397 |
0.29289 |
0.50000 |
1.00000 |
0.50000 |
0.00000 |
0.13397 |
0.29289 |
0.50000 |
1.00000 |
0.58567 |
0.048946 |
11.9657 |
<.0001 |
0.48974 |
0.68160 |
0.58567 |
0.078569 |
7.4542 |
<.0001 |
0.43168 |
0.73966 |
0.58567 |
0.092568 |
6.3269 |
<.0001 |
0.40424 |
0.76710 |
0.35517 |
0.068474 |
0.22097 |
0.48938 |
0.52942 |
0.091248 |
|
More control, in the sense of more rapidly dropping weights at larger differences, can be done by also specifying wtparm=. This can be used along with any of the weight= values (except when weight=data-set-name). This involves quantiles from an exponential distribution with parameter equal to the wtparm= value. Typically, suitable results can be obtained with values between 0.01 and 2. A parameter of 1 is used along with the linear matrix in this next call to the macro.
%magree(data=a,
items=s, raters=r, response=y,
stat=gwet glmm, weight=linear, wtparm=1, options=nosummary)
Compared to the linear matrix alone, the weight diminishes quickly, becoming zero for the second largest difference in addition to the largest. The AC2 estimate is 0.614 which is slightly less than with the linear matrix alone. κma remains unchanged at 0.355.
1.00000 |
0.71232 |
0.30685 |
0.00000 |
0.00000 |
0.71232 |
1.00000 |
0.71232 |
0.30685 |
0.00000 |
0.30685 |
0.71232 |
1.00000 |
0.71232 |
0.30685 |
0.00000 |
0.30685 |
0.71232 |
1.00000 |
0.71232 |
0.00000 |
0.00000 |
0.30685 |
0.71232 |
1.00000 |
0.61407 |
0.049157 |
12.4920 |
<.0001 |
0.51772 |
0.71041 |
0.61407 |
0.086580 |
7.0924 |
<.0001 |
0.44437 |
0.78376 |
0.61407 |
0.099562 |
6.1677 |
<.0001 |
0.41893 |
0.80920 |
0.35517 |
0.068474 |
0.22097 |
0.48938 |
0.52942 |
0.091248 |
|
Note that if the percategory option were used in the above, the category-level Gwet AC2 results would not change with different weights since those results are based on a dichotomized response. With a two-level response, the ordinality is lost and so weights are not used.
- EXAMPLE 6: BY group processing
- While the MAGREE macro does not support BY processing directly, the RunBY macro can be used to run the macro on BY groups
in the data. In the statements below, the data set in Example 1 above is expanded to include data from subjects and raters on two different days.
The RunBY macro is then used to run the MAGREE macro on each day in turn. This is done with a DATA step that includes a subsetting WHERE statement that
specifies the special macro variables, _BYx and _LVLx, which are used by the RunBY macro to process each BY group. The BYlabel macro variable is also used
to label the displayed results with the BY group definition. Since the MAGREE macro writes its own titles, a FOOTNOTE statement is used instead of a TITLE
statement to provide the label.
data ratings;
do day=1,2;
do s=1 to 10;
do r=1 to 5;
input y @@;
output;
end; end; end;
datalines;
1 2 2 2 2
1 1 3 3 3
3 3 3 3 3
1 1 1 1 3
1 1 1 3 3
1 2 2 2 2
1 1 1 1 1
2 2 2 2 3
1 3 3 3 3
1 1 1 3 3
1 2 2 2 2
1 1 3 3 3
3 3 2 3 3
1 1 1 1 3
1 1 2 3 3
1 2 2 2 2
1 1 2 3 1
2 1 2 2 3
1 3 3 3 3
1 1 1 3 3
;
%macro code();
data subset; set ratings; where &_BY1=&_LVL1; run;
footnote "Above for &BYlabel";
%magree(data=subset, items=s, raters=r, response=y, stat=nominal)
footnote;
%mend;
%RunBY(data=ratings, by=day)