Item Analysis
Item analysis is the statistical examination of test items after administration to evaluate their quality. It answers three questions about each item: How difficult was it? Did it distinguish between stronger and weaker students? Were the wrong answer options (distractors) effective? The results guide decisions about which items to keep, revise, or discard — and provide evidence for the overall quality and reliability of the test.
Item analysis is essential for any institution that develops its own tests, including achievement tests and progress tests. Without it, test writers are relying on intuition alone, which research consistently shows is unreliable.
Core Statistics
Facility Value (Item Difficulty)
The facility value (FV), also called the difficulty index (p), measures how easy or difficult an item is. It is simply the proportion of test-takers who answered correctly.
Formula:
FV = Number of correct responses / Total number of test-takers
| FV Range | Interpretation | Action |
|---|---|---|
| 0.90–1.00 | Very easy — almost everyone got it right | Consider removing (no discrimination) |
| 0.70–0.89 | Easy | Acceptable for early items or confidence building |
| 0.30–0.69 | Moderate | Ideal range for most test items |
| 0.10–0.29 | Difficult | Acceptable if the item discriminates well |
| 0.00–0.09 | Very difficult — almost no one got it right | Investigate — possible flawed item |
A well-constructed test should contain items across the difficulty spectrum, with the majority in the 0.30–0.70 range (Hughes, 2003). The optimal mean difficulty for a norm-referenced test is around 0.50, which maximises the spread of scores.
Discrimination Index
The discrimination index (D) measures how well an item separates high-ability from low-ability test-takers. If strong students get it right and weak students get it wrong, the item discriminates well.
Calculation method (upper-lower method):
- Rank all test-takers by total score
- Take the top 27% (upper group) and bottom 27% (lower group)
- Calculate:
D = (proportion correct in upper group) − (proportion correct in lower group)
| D Value | Interpretation | Action |
|---|---|---|
| 0.40+ | Excellent discrimination | Keep |
| 0.30–0.39 | Good discrimination | Keep |
| 0.20–0.29 | Acceptable | Review for possible improvement |
| 0.10–0.19 | Poor discrimination | Revise |
| 0.00–0.09 | No discrimination | Discard or completely rewrite |
| Negative | Reversed discrimination (weak students outperform strong) | Discard — likely a flawed or ambiguous item |
A negative discrimination index is a red flag: it typically means the item is ambiguous (strong students overthink it), the key is wrong, or the item tests something unrelated to the construct.
Point-Biserial Correlation
A more sophisticated alternative to the discrimination index. The point-biserial correlation (rpb) measures the correlation between performance on a single item (correct/incorrect) and the total test score. Values above 0.25 are generally considered acceptable; values above 0.40 are strong.
Distractor Analysis
For multiple-choice items, distractor analysis examines how each wrong option performed:
| Indicator | What it means |
|---|---|
| Distractor chosen by < 5% of test-takers | Non-functional — too implausible to attract anyone |
| Distractor chosen more by upper group than lower group | Malfunctioning — misleading strong students |
| Distractors roughly equally attractive to lower group | Well-designed — all options are plausible |
An ideal MCQ item has one clearly correct answer that the upper group selects, and 2–3 distractors that attract the lower group roughly equally. Non-functional distractors should be replaced with more plausible alternatives.
Practical Procedure
For a classroom teacher or test development team:
- Administer the test to a sufficiently large group (ideally 30+)
- Score all papers and rank by total score
- Calculate FV and D for each item (spreadsheet or testing software)
- Examine distractors for MCQ items
- Flag items with extreme FV (< 0.20 or > 0.90), poor D (< 0.20), or negative D
- Decide: keep, revise, or discard each flagged item
- Build an item bank of validated items for future tests
Item Analysis and Test Quality
Item analysis data contributes to broader test quality:
- Reliability — Removing poorly discriminating items typically raises the test's internal consistency (Cronbach's alpha)
- Construct Validity — Items with negative discrimination may be testing something other than the intended construct
- Fairness — Items with unexpected difficulty patterns may indicate cultural or linguistic bias
- Test length — Analysis may reveal redundant items that can be removed without reducing reliability
Limitations
- Requires a sufficient sample size — item statistics from a class of 10 are unreliable
- Facility value is group-dependent — the same item has different FV values with different populations
- Classical item analysis cannot separate item difficulty from student ability (for that, see Rasch/IRT models)
- Only applies to objectively scored items — open-ended tasks require different quality checks (e.g., rubric review, rater agreement studies)
Key References
- Hughes, A. (2003). Testing for Language Teachers (2nd ed.). Cambridge University Press.
- Brown, H. D. & Abeywickrama, P. (2010). Language Assessment: Principles and Classroom Practices (2nd ed.). Pearson.
- Bachman, L. F. & Palmer, A. S. (1996). Language Testing in Practice. Oxford University Press.
- Heaton, J. B. (1988). Writing English Language Tests. Longman.
- Alderson, J. C., Clapham, C. & Wall, D. (1995). Language Test Construction and Evaluation. Cambridge University Press.
See Also
- Achievement Test — item analysis is essential for improving institutional tests
- Reliability — removing weak items improves test reliability
- Construct Validity — item analysis provides evidence for (or against) construct validity
- Validity — item quality underpins overall test validity