ELTiverse

Search Terms

Search for ELT terms and concepts

Standardization

AssessmentExaminer StandardizationRater TrainingBenchmarking

Standardization is the process of training and calibrating examiners/raters so they apply a rating scale consistently. Without it, the same performance can receive different scores from different raters — undermining inter-rater reliability and, by extension, test validity.

The Process

A typical standardization session follows this sequence:

  1. Familiarization — raters study the rating scale descriptors and assessment criteria
  2. Benchmarking — raters score sample performances that have been pre-rated by senior examiners; discrepancies are discussed
  3. Practice rating — raters score additional samples independently, then compare and reconcile
  4. Certification — raters must achieve acceptable agreement levels (often measured by exact or adjacent agreement rates, or correlation coefficients) before they are approved to rate live assessments

Key Concepts

  • Rater severity/leniency — individual raters tend to be systematically harsh or generous; standardization aims to narrow this range
  • Rater drift — even trained raters become less consistent over time, requiring re-standardization (IELTS examiners are re-certified regularly)
  • Multi-faceted Rasch measurement (Linacre 1989) — a statistical approach that models rater severity as a measurable facet alongside candidate ability and task difficulty, enabling post-hoc adjustment

IELTS as a Case Study

IELTS employs one of the most rigorous standardization systems in language testing. Examiners undergo initial certification training, are monitored through recorded assessments, and must pass re-certification. Double marking is used for Writing. This infrastructure is what allows scores from different test centers worldwide to be meaningfully compared — the scale means the same thing regardless of who rates it.

Practical Implication

Any institution using subjective assessment (speaking, writing) needs some form of standardization. Even informal moderation meetings — where teachers score the same student work and discuss differences — substantially improve scoring consistency.

Related Terms