Item Discrimination
Item discrimination measures how effectively a test item differentiates between high-ability and low-ability test takers. A good item is one that strong candidates tend to get right and weak candidates tend to get wrong. An item that everyone gets right, everyone gets wrong, or that strong and weak candidates answer equally well provides no useful information about individual differences.
Calculation: The Discrimination Index (D)
The most common method in classical test theory:
- Rank all test takers by total score
- Take the top 27% (upper group) and bottom 27% (lower group)
- Calculate:
where p is the proportion answering correctly in each group.
| D Value | Interpretation | Action |
|---|---|---|
| 0.40+ | Excellent | Keep |
| 0.30β0.39 | Good | Keep |
| 0.20β0.29 | Acceptable | Review for improvement |
| 0.10β0.19 | Poor | Revise substantially |
| 0.00β0.09 | No discrimination | Discard or rewrite |
| Negative | Reversed β weak outperform strong | Discard immediately |
Why 27%?
Kelley (1939) demonstrated that using the top and bottom 27% optimises the balance between making the groups as different as possible (which favours extreme groups) and having enough data in each group for stable estimates. In practice, using 25% or 33% produces similar results with typical class sizes.
Point-Biserial Correlation
A more statistically robust alternative. The point-biserial correlation (rpb) measures the relationship between performance on a single item (scored 0/1) and the total test score. Values above 0.25 are generally acceptable; above 0.40 is strong. Unlike D, the point-biserial uses all test takers' data, not just the extreme groups.
Negative Discrimination
A negative D value is a serious red flag. Common causes:
- Miskeying β the marked answer is wrong
- Ambiguity β the item can be legitimately interpreted in multiple ways, and stronger candidates see the alternative reading
- Construct-irrelevant difficulty β the item tests something other than the intended construct (e.g., obscure vocabulary in a reading comprehension item)
- Cultural bias β background knowledge advantages the lower-ability group on this particular item
Any item with negative discrimination should be removed from scoring and investigated before reuse.
Relationship to Item Difficulty
Item difficulty and discrimination are linked. Items at extreme difficulty levels (p < 0.10 or p > 0.90) have restricted room to discriminate β if nearly everyone gets an item right, there is little variance to work with. The highest discrimination potential occurs at moderate difficulty (p β 0.50 for norm-referenced tests).
However, the relationship is not deterministic. A moderately difficult item can still have poor discrimination if it is ambiguous or measures something unrelated to overall ability.
Key References
- Ebel, R. L. & Frisbie, D. A. (1991). Essentials of Educational Measurement (5th ed.). Prentice Hall.
- Hughes, A. (2003). Testing for Language Teachers (2nd ed.). Cambridge University Press.
- Brown, J. D. (2005). Testing in Language Programs: A Comprehensive Guide to English Language Assessment. McGraw-Hill.
See Also
- Item Analysis β the broader process encompassing difficulty, discrimination, and distractor analysis
- Item Difficulty β the companion statistic measuring how easy or hard an item is
- Classical Test Theory β the measurement framework underpinning the discrimination index
- Reliability β removing poorly discriminating items improves test reliability