ELTiverse

Search Terms

Search for ELT terms and concepts

Item Analysis

Assessmentitem analysistest item analysisitem statistics

Item analysis is the statistical examination of test items after administration to evaluate their quality. It answers three questions about each item: How difficult was it? Did it distinguish between stronger and weaker students? Were the wrong answer options (distractors) effective? The results guide decisions about which items to keep, revise, or discard — and provide evidence for the overall quality and reliability of the test.

Item analysis is essential for any institution that develops its own tests, including achievement tests and progress tests. Without it, test writers are relying on intuition alone, which research consistently shows is unreliable.

Core Statistics

Facility Value (Item Difficulty)

The facility value (FV), also called the difficulty index (p), measures how easy or difficult an item is. It is simply the proportion of test-takers who answered correctly.

Formula:

FV = Number of correct responses / Total number of test-takers
FV RangeInterpretationAction
0.90–1.00Very easy — almost everyone got it rightConsider removing (no discrimination)
0.70–0.89EasyAcceptable for early items or confidence building
0.30–0.69ModerateIdeal range for most test items
0.10–0.29DifficultAcceptable if the item discriminates well
0.00–0.09Very difficult — almost no one got it rightInvestigate — possible flawed item

A well-constructed test should contain items across the difficulty spectrum, with the majority in the 0.30–0.70 range (Hughes, 2003). The optimal mean difficulty for a norm-referenced test is around 0.50, which maximises the spread of scores.

Discrimination Index

The discrimination index (D) measures how well an item separates high-ability from low-ability test-takers. If strong students get it right and weak students get it wrong, the item discriminates well.

Calculation method (upper-lower method):

  1. Rank all test-takers by total score
  2. Take the top 27% (upper group) and bottom 27% (lower group)
  3. Calculate: D = (proportion correct in upper group) − (proportion correct in lower group)
D ValueInterpretationAction
0.40+Excellent discriminationKeep
0.30–0.39Good discriminationKeep
0.20–0.29AcceptableReview for possible improvement
0.10–0.19Poor discriminationRevise
0.00–0.09No discriminationDiscard or completely rewrite
NegativeReversed discrimination (weak students outperform strong)Discard — likely a flawed or ambiguous item

A negative discrimination index is a red flag: it typically means the item is ambiguous (strong students overthink it), the key is wrong, or the item tests something unrelated to the construct.

Point-Biserial Correlation

A more sophisticated alternative to the discrimination index. The point-biserial correlation (rpb) measures the correlation between performance on a single item (correct/incorrect) and the total test score. Values above 0.25 are generally considered acceptable; values above 0.40 are strong.

Distractor Analysis

For multiple-choice items, distractor analysis examines how each wrong option performed:

IndicatorWhat it means
Distractor chosen by < 5% of test-takersNon-functional — too implausible to attract anyone
Distractor chosen more by upper group than lower groupMalfunctioning — misleading strong students
Distractors roughly equally attractive to lower groupWell-designed — all options are plausible

An ideal MCQ item has one clearly correct answer that the upper group selects, and 2–3 distractors that attract the lower group roughly equally. Non-functional distractors should be replaced with more plausible alternatives.

Practical Procedure

For a classroom teacher or test development team:

  1. Administer the test to a sufficiently large group (ideally 30+)
  2. Score all papers and rank by total score
  3. Calculate FV and D for each item (spreadsheet or testing software)
  4. Examine distractors for MCQ items
  5. Flag items with extreme FV (< 0.20 or > 0.90), poor D (< 0.20), or negative D
  6. Decide: keep, revise, or discard each flagged item
  7. Build an item bank of validated items for future tests

Item Analysis and Test Quality

Item analysis data contributes to broader test quality:

  • Reliability — Removing poorly discriminating items typically raises the test's internal consistency (Cronbach's alpha)
  • Construct Validity — Items with negative discrimination may be testing something other than the intended construct
  • Fairness — Items with unexpected difficulty patterns may indicate cultural or linguistic bias
  • Test length — Analysis may reveal redundant items that can be removed without reducing reliability

Limitations

  • Requires a sufficient sample size — item statistics from a class of 10 are unreliable
  • Facility value is group-dependent — the same item has different FV values with different populations
  • Classical item analysis cannot separate item difficulty from student ability (for that, see Rasch/IRT models)
  • Only applies to objectively scored items — open-ended tasks require different quality checks (e.g., rubric review, rater agreement studies)

Key References

  • Hughes, A. (2003). Testing for Language Teachers (2nd ed.). Cambridge University Press.
  • Brown, H. D. & Abeywickrama, P. (2010). Language Assessment: Principles and Classroom Practices (2nd ed.). Pearson.
  • Bachman, L. F. & Palmer, A. S. (1996). Language Testing in Practice. Oxford University Press.
  • Heaton, J. B. (1988). Writing English Language Tests. Longman.
  • Alderson, J. C., Clapham, C. & Wall, D. (1995). Language Test Construction and Evaluation. Cambridge University Press.

See Also

Related Terms