Inter-rater Reliability
Inter-rater reliability is the degree of agreement between different raters scoring the same performance. When two examiners read the same essay or listen to the same speaking performance, do they arrive at the same score? If not, the scores reflect rater differences rather than test-taker ability — and the test results are unreliable.
Inter-rater reliability is the critical reliability concern for subjectively scored tasks: writing, speaking, and any performance-based assessment. Objectively scored items (multiple choice, gap-fill) do not have inter-rater reliability issues because the scoring key determines the score.
Measurement
Cohen's Kappa (k)
Measures agreement adjusted for chance. Ranges from -1 to +1:
| Kappa value | Interpretation |
|---|---|
| < 0.20 | Poor agreement |
| 0.21-0.40 | Fair |
| 0.41-0.60 | Moderate |
| 0.61-0.80 | Substantial |
| 0.81-1.00 | Near-perfect |
Pearson Correlation
Measures the linear relationship between two sets of scores. High correlation (r > .80) suggests raters rank test-takers similarly, but does not guarantee they assign the same scores — one rater could be consistently harsher than another (a systematic bias that correlation misses).
Exact Agreement and Adjacent Agreement
Percentage of scores where raters assign the same score (exact) or within one band (adjacent). IELTS targets exact + adjacent agreement above 90% for its writing and speaking examiners.
Many-Facet Rasch Measurement (MFRM)
Linacre's (1989) extension of Rasch measurement that models test-taker ability, item difficulty, and rater severity simultaneously. Used in large-scale testing to identify rater effects and adjust scores accordingly. This is the gold standard for inter-rater reliability analysis in high-stakes language testing.
Intra-rater vs Inter-rater Reliability
| Type | Question | Example |
|---|---|---|
| Inter-rater | Do different raters agree? | Rater A gives 6, Rater B gives 4 |
| Intra-rater | Does the same rater agree with themselves over time? | Rater A gives 6 on Monday, 5 on Friday |
Both matter. A rater who is internally consistent (high intra-rater) but systematically harsh (low inter-rater) can be calibrated. A rater who is inconsistent even with themselves is a deeper problem.
Sources of Rater Disagreement
Severity/leniency. Some raters are consistently stricter or more generous than others. This is the most common and most manageable rater effect.
Central tendency. Raters avoid extreme scores, clustering everything around the middle of the scale. This compresses the score distribution and reduces discrimination between strong and weak performances.
Halo effect. A strong (or weak) impression in one area biases scoring in other areas. A beautifully handwritten essay may receive inflated scores for content; a speaker with a strong accent may receive depressed scores for grammar.
Rater fatigue. Scoring quality degrades over long marking sessions. Scores assigned at the end of a batch are often less reliable than those at the beginning.
Construct interpretation. Raters may understand the scoring criteria differently. What counts as "a range of vocabulary" or "adequate coherence" can vary between raters without explicit calibration.
Improving Inter-rater Reliability
Clear rubrics with descriptors. Detailed scoring guides with level-specific descriptors, concrete examples, and boundary definitions reduce ambiguity. Vague criteria ("good grammar") produce disagreement; specific criteria ("uses a range of complex structures with occasional errors that do not impede communication") constrain interpretation.
Standardization sessions. Before marking begins, raters score benchmark samples together, discuss disagreements, and align their interpretation of the scale. Cambridge Assessment and IDP run regular standardization sessions for IELTS examiners.
Benchmark scripts. Reference performances at each band level that raters can consult during marking. These anchor the scale to concrete examples.
Double marking. Two raters independently score each performance; the final score is the average or, if discrepant, a third rater adjudicates. Essential for high-stakes decisions.
Rater training. Ongoing training, not just initial certification. Rater accuracy drifts over time without recalibration.
Statistical monitoring. Track individual rater statistics (mean severity, consistency, fit to the scoring model) and flag raters who deviate. MFRM-based monitoring is standard in large-scale testing.
Why It Matters
Every teacher who marks a writing task or scores a speaking performance is a rater. Inter-rater reliability concerns apply not just to large-scale testing but to everyday classroom assessment.
At EH, with multiple teachers scoring writing and speaking across the IELTS program:
- Different teachers marking the same essay should give approximately the same score
- Regular standardization sessions using benchmark essays are essential
- Shared rubrics with detailed descriptors ensure everyone is interpreting the criteria the same way
- When scores determine whether a student progresses to the next level, inter-rater reliability is a fairness issue, not just a statistical one
Key References
- Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice. Oxford University Press.
- McNamara, T. (1996). Measuring Second Language Performance. Longman.
- Lumley, T. (2005). Assessing Second Language Writing: The Rater's Perspective. Peter Lang.
- Linacre, J. M. (1989). Many-Facet Rasch Measurement. MESA Press.
- Weigle, S. C. (2002). Assessing Writing. Cambridge University Press.
- Hughes, A. (2003). Testing for Language Teachers (2nd ed.). Cambridge University Press.