Item Difficulty
Item difficulty (also called the facility value or p-value) is the proportion of test takers who answer an item correctly. Despite its name, a higher value means an easier item.
The value ranges from 0 (no one answered correctly) to 1 (everyone answered correctly). It is one of the two core statistics in item analysis, alongside item discrimination.
Interpreting the P-Value
| P-Value | Interpretation |
|---|---|
| 0.90–1.00 | Very easy — nearly all candidates correct |
| 0.70–0.89 | Easy |
| 0.30–0.69 | Moderate — optimal range for most items |
| 0.10–0.29 | Difficult |
| 0.00–0.09 | Very difficult — investigate for possible flaws |
The ideal difficulty depends on the test's purpose:
- Norm-referenced tests — Items clustered around p = 0.50 maximise score spread and discrimination. A mean difficulty of 0.50 produces the widest distribution of total scores.
- Criterion-referenced tests — Item difficulty should reflect the actual difficulty of the target domain. If 80% of competent users can perform a task, a p-value of 0.80 among competent test takers is appropriate, not a defect.
- Achievement tests — Items may legitimately be easy (p > 0.80) if they test core content that most learners should have acquired.
Relationship to Discrimination
Extremely easy and extremely difficult items cannot discriminate well. If everyone gets an item right (p = 1.0) or everyone gets it wrong (p = 0.0), the item tells us nothing about individual differences. Items in the moderate range (0.30–0.70) have the most room to differentiate between stronger and weaker candidates.
However, an item with p = 0.85 can still have acceptable discrimination if the 15% who got it wrong are consistently from the low-scoring group.
Limitations Within Classical Test Theory
In CTT, item difficulty is sample-dependent — the same item will have different p-values when administered to groups of different ability levels. This is a fundamental limitation: an item is not inherently "difficult" or "easy" in isolation; it is difficult for a particular group.
Item Response Theory (IRT) addresses this limitation by modelling item difficulty as a parameter independent of the sample, but CTT's simplicity makes p-values practical and widely used in classroom and institutional testing contexts.
Practical Guidelines
- Calculate p-values for every item after each test administration
- Flag items with p < 0.20 or p > 0.90 for review
- Check flagged items for ambiguity, cultural bias, or miskeying
- Maintain a range of difficulties across the test — a mix of easy, moderate, and hard items
- Record p-values in an item bank to track item behaviour across administrations
Key References
- Hughes, A. (2003). Testing for Language Teachers (2nd ed.). Cambridge University Press.
- Brown, H. D. & Abeywickrama, P. (2010). Language Assessment: Principles and Classroom Practices. Pearson.
- Bachman, L. F. & Palmer, A. S. (1996). Language Testing in Practice. Oxford University Press.
See Also
- Item Analysis — the broader framework that includes difficulty, discrimination, and distractor analysis
- Item Discrimination — how well an item separates strong from weak candidates
- Classical Test Theory — the measurement framework underpinning p-values