Rater Training
Rater training is the systematic process of calibrating examiners to apply rating scales and band descriptors consistently. Without it, subjective assessment degenerates into personal opinion — different raters apply different standards, and scores become unreliable.
The goal is not to eliminate human judgment (that is what makes performance assessment valuable) but to align it: raters should agree not because they are robots, but because they share a common understanding of what each band level looks like.
The Problem Rater Training Solves
Untrained raters exhibit predictable biases:
| Bias | Description |
|---|---|
| Severity/leniency | Some raters consistently score higher or lower than others |
| Central tendency | Avoiding extreme scores, clustering marks in the middle bands |
| Halo effect | One strong feature (e.g., fluency) inflating scores across all criteria |
| First impression | The opening of a performance disproportionately influencing the overall score |
| Fatigue | Scores drifting as raters tire over a long marking session |
| L1 bias | Rating candidates from familiar L1 backgrounds differently |
These biases directly undermine inter-rater reliability and, by extension, the fairness and validity of the assessment.
Components of Rater Training
1. Orientation
Raters study the rubric, band descriptors, and test specifications. The focus is on understanding what each criterion means and how the bands differ from one another. This is not a passive reading exercise — it requires active discussion and clarification.
2. Benchmarking
Raters score a set of benchmark samples — performances that have been pre-rated by expert panels and represent each band level. After scoring independently, raters compare their scores with the benchmark and discuss discrepancies. This is where calibration happens: raters adjust their internal standards to align with the agreed benchmarks.
3. Practice Rating
Raters score additional samples independently, receive feedback, and discuss borderline cases. Multiple rounds may be needed. The emphasis is on applying the descriptors as written, not on personal preferences about what "good" writing or speaking sounds like.
4. Qualification
In formal testing programmes (IELTS, Cambridge), raters must pass a standardisation test — scoring a set of performances within acceptable tolerance of the benchmark scores. Raters who fall outside tolerance receive additional training or are not certified.
5. Ongoing Monitoring
Training is not a one-off event. Rater drift occurs over time as personal standards gradually shift. Monitoring mechanisms include:
- Regular re-standardisation sessions
- Statistical analysis of each rater's scoring patterns (mean, SD, inter-rater agreement)
- Double-marking a proportion of scripts
- Feedback on flagged scores that deviate significantly from expected patterns
Standardisation Meetings
The standardisation meeting is the core event in rater training. A typical session:
- Review the rating criteria and any updates
- Score 3–5 benchmark samples independently
- Compare scores and discuss reasons for agreement/disagreement
- Score additional samples and check convergence
- Address specific problem areas (e.g., how to score a Band 5/6 borderline performance)
These meetings work best with an experienced lead who can guide discussion toward the descriptors rather than allowing debate to become personal preference.
In Institutional Contexts
Language schools and universities that develop their own writing or speaking assessments need rater training even with small teams. The minimum:
- A shared rubric with clear band descriptors
- A set of anchor scripts at each level (collected and agreed upon by the team)
- At least one calibration session before each assessment period
- A sample of double-marked scripts to check agreement
Key References
- Weigle, S. C. (2002). Assessing Writing. Cambridge University Press.
- Lumley, T. (2005). Assessing Second Language Writing: The Rater's Perspective. Peter Lang.
- McNamara, T. (1996). Measuring Second Language Performance. Longman.
See Also
- Inter-rater Reliability — the outcome that rater training aims to achieve
- Rating Scale — the instrument raters are trained to apply
- Band Descriptors — the descriptions raters must internalise
- Moderation — the quality assurance process that checks rater consistency
- Interlocutor Frame — standardised prompts that complement rater training in speaking tests