Differential Item Functioning
Differential item functioning (DIF) occurs when an item behaves differently for examinees of equal ability who belong to different subgroups — typically defined by sex, first language, age, or instructional background. DIF analysis controls for ability so that simple group differences in difficulty are not mistaken for bias. An item is flagged for DIF only when subgroups of matched ability differ in expected response.
Detection methods
The Mantel-Haenszel procedure stratifies examinees by total score, computes an odds ratio of correct response between focal and reference groups within each stratum, and pools across strata. Educational Testing Service's MH delta classification — A, B, or C — has long served as a routine screen on operational tests.
Logistic regression methods regress item response on ability, group membership, and their interaction. A significant group main effect signals uniform DIF; a significant interaction signals non-uniform DIF, where the focal group is favoured at one ability range and the reference group at another. Parallel IRT-based methods compare item parameters across groups, with the Lord's chi-square test, the likelihood-ratio test, and Raju's area measures as common indices. Rasch DIF analyses report item difficulty by subgroup and contrast statistics from FACETS or Winsteps output.
Use in language testing
DIF analysis is now routine in major language tests. Studies have examined whether IELTS or TOEFL items disadvantage particular L1 backgrounds, gender groups, or age bands. A flagged item is reviewed by content experts: DIF identifies a statistical asymmetry, but only substantive review can decide whether the asymmetry reflects construct-irrelevant variance — a validity threat — or genuine, intended differences in the construct itself. Sustained DIF that disadvantages an identifiable subgroup feeds directly into consequential validity arguments and into fairness review under the Standards.
References
- Holland, P. W., & Wainer, H. (Eds.). (1993). Differential Item Functioning. Lawrence Erlbaum Associates.
- AERA, APA, & NCME (2014). Standards for Educational and Psychological Testing. American Educational Research Association.
- Ferne, T., & Rupp, A. A. (2007). A synthesis of 15 years of research on DIF in language testing. Language Assessment Quarterly, 4(2), 113–148.