Rater Training

Assessmentprofessional-development

Rater training is the systematic process of calibrating examiners to apply rating scales and band descriptors consistently. Without it, subjective assessment degenerates into personal opinion — different raters apply different standards, and scores become unreliable.

The goal is not to eliminate human judgment (that is what makes performance assessment valuable) but to align it: raters should agree not because they are robots, but because they share a common understanding of what each band level looks like.

The Problem Rater Training Solves

Untrained raters exhibit predictable biases:

Bias	Description
Severity/leniency	Some raters consistently score higher or lower than others
Central tendency	Avoiding extreme scores, clustering marks in the middle bands
Halo effect	One strong feature (e.g., fluency) inflating scores across all criteria
First impression	The opening of a performance disproportionately influencing the overall score
Fatigue	Scores drifting as raters tire over a long marking session
L1 bias	Rating candidates from familiar L1 backgrounds differently

These biases directly undermine inter-rater reliability and, by extension, the fairness and validity of the assessment.

Components of Rater Training

1. Orientation

Raters study the rubric, band descriptors, and test specifications. The focus is on understanding what each criterion means and how the bands differ from one another. This is not a passive reading exercise — it requires active discussion and clarification.

2. Benchmarking

Raters score a set of benchmark samples — performances that have been pre-rated by expert panels and represent each band level. After scoring independently, raters compare their scores with the benchmark and discuss discrepancies. This is where calibration happens: raters adjust their internal standards to align with the agreed benchmarks.

3. Practice Rating

Raters score additional samples independently, receive feedback, and discuss borderline cases. Multiple rounds may be needed. The emphasis is on applying the descriptors as written, not on personal preferences about what "good" writing or speaking sounds like.

4. Qualification

In formal testing programmes (IELTS, Cambridge), raters must pass a standardisation test — scoring a set of performances within acceptable tolerance of the benchmark scores. Raters who fall outside tolerance receive additional training or are not certified.

5. Ongoing Monitoring

Training is not a one-off event. Rater drift occurs over time as personal standards gradually shift. Monitoring mechanisms include:

Regular re-standardisation sessions
Statistical analysis of each rater's scoring patterns (mean, SD, inter-rater agreement)
Double-marking a proportion of scripts
Feedback on flagged scores that deviate significantly from expected patterns

Standardisation Meetings

The standardisation meeting is the core event in rater training. A typical session:

Review the rating criteria and any updates
Score 3–5 benchmark samples independently
Compare scores and discuss reasons for agreement/disagreement
Score additional samples and check convergence
Address specific problem areas (e.g., how to score a Band 5/6 borderline performance)

These meetings work best with an experienced lead who can guide discussion toward the descriptors rather than allowing debate to become personal preference.

In Institutional Contexts

Language schools and universities that develop their own writing or speaking assessments need rater training even with small teams. The minimum:

A shared rubric with clear band descriptors
A set of anchor scripts at each level (collected and agreed upon by the team)
At least one calibration session before each assessment period
A sample of double-marked scripts to check agreement

Key References

Weigle, S. C. (2002). Assessing Writing. Cambridge University Press.
Lumley, T. (2005). Assessing Second Language Writing: The Rater's Perspective. Peter Lang.
McNamara, T. (1996). Measuring Second Language Performance. Longman.