Item Analysis

Assessmentitem analysistest item analysisitem statistics

Item analysis is the statistical examination of test items after administration to evaluate their quality. It answers three questions about each item: How difficult was it? Did it distinguish between stronger and weaker students? Were the wrong answer options (distractors) effective? The results guide decisions about which items to keep, revise, or discard — and provide evidence for the overall quality and reliability of the test.

Item analysis is essential for any institution that develops its own tests, including achievement tests and progress tests. Without it, test writers are relying on intuition alone, which research consistently shows is unreliable.

Core Statistics

Facility Value (Item Difficulty)

The facility value (FV), also called the difficulty index (p), measures how easy or difficult an item is. It is simply the proportion of test-takers who answered correctly.

Formula:

FV = Number of correct responses / Total number of test-takers

FV Range	Interpretation	Action
0.90–1.00	Very easy — almost everyone got it right	Consider removing (no discrimination)
0.70–0.89	Easy	Acceptable for early items or confidence building
0.30–0.69	Moderate	Ideal range for most test items
0.10–0.29	Difficult	Acceptable if the item discriminates well
0.00–0.09	Very difficult — almost no one got it right	Investigate — possible flawed item

A well-constructed test should contain items across the difficulty spectrum, with the majority in the 0.30–0.70 range (Hughes, 2003). The optimal mean difficulty for a norm-referenced test is around 0.50, which maximises the spread of scores.

Discrimination Index

The discrimination index (D) measures how well an item separates high-ability from low-ability test-takers. If strong students get it right and weak students get it wrong, the item discriminates well.

Calculation method (upper-lower method):

Rank all test-takers by total score
Take the top 27% (upper group) and bottom 27% (lower group)
Calculate: D = (proportion correct in upper group) − (proportion correct in lower group)

D Value	Interpretation	Action
0.40+	Excellent discrimination	Keep
0.30–0.39	Good discrimination	Keep
0.20–0.29	Acceptable	Review for possible improvement
0.10–0.19	Poor discrimination	Revise
0.00–0.09	No discrimination	Discard or completely rewrite
Negative	Reversed discrimination (weak students outperform strong)	Discard — likely a flawed or ambiguous item

A negative discrimination index is a red flag: it typically means the item is ambiguous (strong students overthink it), the key is wrong, or the item tests something unrelated to the construct.

Point-Biserial Correlation

A more sophisticated alternative to the discrimination index. The point-biserial correlation (rpb) measures the correlation between performance on a single item (correct/incorrect) and the total test score. Values above 0.25 are generally considered acceptable; values above 0.40 are strong.

Distractor Analysis

For multiple-choice items, distractor analysis examines how each wrong option performed:

Indicator	What it means
Distractor chosen by < 5% of test-takers	Non-functional — too implausible to attract anyone
Distractor chosen more by upper group than lower group	Malfunctioning — misleading strong students
Distractors roughly equally attractive to lower group	Well-designed — all options are plausible

An ideal MCQ item has one clearly correct answer that the upper group selects, and 2–3 distractors that attract the lower group roughly equally. Non-functional distractors should be replaced with more plausible alternatives.

Practical Procedure

For a classroom teacher or test development team:

Administer the test to a sufficiently large group (ideally 30+)
Score all papers and rank by total score
Calculate FV and D for each item (spreadsheet or testing software)
Examine distractors for MCQ items
Flag items with extreme FV (< 0.20 or > 0.90), poor D (< 0.20), or negative D
Decide: keep, revise, or discard each flagged item
Build an item bank of validated items for future tests

Item Analysis and Test Quality

Item analysis data contributes to broader test quality:

Reliability — Removing poorly discriminating items typically raises the test's internal consistency (Cronbach's alpha)
Construct Validity — Items with negative discrimination may be testing something other than the intended construct
Fairness — Items with unexpected difficulty patterns may indicate cultural or linguistic bias
Test length — Analysis may reveal redundant items that can be removed without reducing reliability

Limitations

Requires a sufficient sample size — item statistics from a class of 10 are unreliable
Facility value is group-dependent — the same item has different FV values with different populations
Classical item analysis cannot separate item difficulty from student ability (for that, see Rasch/IRT models)
Only applies to objectively scored items — open-ended tasks require different quality checks (e.g., rubric review, rater agreement studies)

Key References

Hughes, A. (2003). Testing for Language Teachers (2nd ed.). Cambridge University Press.
Brown, H. D. & Abeywickrama, P. (2010). Language Assessment: Principles and Classroom Practices (2nd ed.). Pearson.
Bachman, L. F. & Palmer, A. S. (1996). Language Testing in Practice. Oxford University Press.
Heaton, J. B. (1988). Writing English Language Tests. Longman.
Alderson, J. C., Clapham, C. & Wall, D. (1995). Language Test Construction and Evaluation. Cambridge University Press.