Intra-rater Reliability

Research MethodologyAssessment

Intra-rater reliability is the consistency of judgements made by a single rater across time, items, or comparable performances. Where inter-rater reliability asks whether different raters agree, intra-rater reliability asks whether the same rater agrees with herself. Both are facets of measurement error in performance assessment.

Estimation

The standard procedure has a rater score the same set of performances on two occasions separated by enough time — typically a week or more — to reduce memory of specific scripts. Agreement between the two scoring rounds is then summarised: the Pearson correlation, intra-class correlation coefficient (ICC), Cohen's kappa for categorical bands, or weighted kappa where category distance matters. Identifying information is usually masked on the second pass and the order of scripts re-randomised.

In rating-scale assessment, intra-rater consistency can also be examined within a single session: a rater scoring fifty essays should not drift in severity from script one to script fifty, and a rater applying multiple criteria — task achievement, coherence, vocabulary — should preserve the relative weighting set by the rubric. The many-facet Rasch model estimates rater consistency through fit statistics: misfit on the rater facet flags within-rater inconsistency that simple agreement coefficients can mask.

Practical relevance

Low intra-rater reliability undermines the entire scoring operation; no amount of moderation between raters compensates for raters who disagree with themselves. Causes include fatigue, drift over a long marking session, ambiguous descriptors, and inadequate rater training. Periodic anchor scripts — pre-scored exemplars inserted into the marking stream — are a routine operational check.

Intra-rater consistency is necessary but not sufficient: a rater can be perfectly self-consistent yet apply the rubric idiosyncratically, producing systematic bias that only inter-rater analysis exposes.

References

AERA, APA, & NCME (2014). Standards for Educational and Psychological Testing. American Educational Research Association.
McNamara, T. (1996). Measuring Second Language Performance. Longman.
Bachman, L. F., & Palmer, A. S. (2010). Language Assessment in Practice. Oxford University Press.

Related Terms

Intra-rater Reliability

Research MethodologyAssessment

Estimation

Practical relevance

References

AERA, APA, & NCME (2014). Standards for Educational and Psychological Testing. American Educational Research Association.
McNamara, T. (1996). Measuring Second Language Performance. Longman.
Bachman, L. F., & Palmer, A. S. (2010). Language Assessment in Practice. Oxford University Press.