Item Discrimination

Assessment

Item discrimination measures how effectively a test item differentiates between high-ability and low-ability test takers. A good item is one that strong candidates tend to get right and weak candidates tend to get wrong. An item that everyone gets right, everyone gets wrong, or that strong and weak candidates answer equally well provides no useful information about individual differences.

Calculation: The Discrimination Index (D)

The most common method in classical test theory:

Rank all test takers by total score
Take the top 27% (upper group) and bottom 27% (lower group)
Calculate:

$D = p_{\text{upper}} - p_{\text{lower}}$

where p is the proportion answering correctly in each group.

D Value	Interpretation	Action
0.40+	Excellent	Keep
0.30–0.39	Good	Keep
0.20–0.29	Acceptable	Review for improvement
0.10–0.19	Poor	Revise substantially
0.00–0.09	No discrimination	Discard or rewrite
Negative	Reversed — weak outperform strong	Discard immediately

Why 27%?

Kelley (1939) demonstrated that using the top and bottom 27% optimises the balance between making the groups as different as possible (which favours extreme groups) and having enough data in each group for stable estimates. In practice, using 25% or 33% produces similar results with typical class sizes.

Point-Biserial Correlation

A more statistically robust alternative. The point-biserial correlation (rpb) measures the relationship between performance on a single item (scored 0/1) and the total test score. Values above 0.25 are generally acceptable; above 0.40 is strong. Unlike D, the point-biserial uses all test takers' data, not just the extreme groups.

Negative Discrimination

A negative D value is a serious red flag. Common causes:

Miskeying — the marked answer is wrong
Ambiguity — the item can be legitimately interpreted in multiple ways, and stronger candidates see the alternative reading
Construct-irrelevant difficulty — the item tests something other than the intended construct (e.g., obscure vocabulary in a reading comprehension item)
Cultural bias — background knowledge advantages the lower-ability group on this particular item

Any item with negative discrimination should be removed from scoring and investigated before reuse.

Relationship to Item Difficulty

Item difficulty and discrimination are linked. Items at extreme difficulty levels (p < 0.10 or p > 0.90) have restricted room to discriminate — if nearly everyone gets an item right, there is little variance to work with. The highest discrimination potential occurs at moderate difficulty (p ≈ 0.50 for norm-referenced tests).

However, the relationship is not deterministic. A moderately difficult item can still have poor discrimination if it is ambiguous or measures something unrelated to overall ability.

Key References

Ebel, R. L. & Frisbie, D. A. (1991). Essentials of Educational Measurement (5th ed.). Prentice Hall.
Hughes, A. (2003). Testing for Language Teachers (2nd ed.). Cambridge University Press.
Brown, J. D. (2005). Testing in Language Programs: A Comprehensive Guide to English Language Assessment. McGraw-Hill.