Item Difficulty

Assessment

Item difficulty (also called the facility value or p-value) is the proportion of test takers who answer an item correctly. Despite its name, a higher value means an easier item.

$p = \frac{\text{Number of correct responses}}{\text{Total number of test takers}}$

The value ranges from 0 (no one answered correctly) to 1 (everyone answered correctly). It is one of the two core statistics in item analysis, alongside item discrimination.

Interpreting the P-Value

P-Value	Interpretation
0.90–1.00	Very easy — nearly all candidates correct
0.70–0.89	Easy
0.30–0.69	Moderate — optimal range for most items
0.10–0.29	Difficult
0.00–0.09	Very difficult — investigate for possible flaws

The ideal difficulty depends on the test's purpose:

Norm-referenced tests — Items clustered around p = 0.50 maximise score spread and discrimination. A mean difficulty of 0.50 produces the widest distribution of total scores.
Criterion-referenced tests — Item difficulty should reflect the actual difficulty of the target domain. If 80% of competent users can perform a task, a p-value of 0.80 among competent test takers is appropriate, not a defect.
Achievement tests — Items may legitimately be easy (p > 0.80) if they test core content that most learners should have acquired.

Relationship to Discrimination

Extremely easy and extremely difficult items cannot discriminate well. If everyone gets an item right (p = 1.0) or everyone gets it wrong (p = 0.0), the item tells us nothing about individual differences. Items in the moderate range (0.30–0.70) have the most room to differentiate between stronger and weaker candidates.

However, an item with p = 0.85 can still have acceptable discrimination if the 15% who got it wrong are consistently from the low-scoring group.

Limitations Within Classical Test Theory

In CTT, item difficulty is sample-dependent — the same item will have different p-values when administered to groups of different ability levels. This is a fundamental limitation: an item is not inherently "difficult" or "easy" in isolation; it is difficult for a particular group.

Item Response Theory (IRT) addresses this limitation by modelling item difficulty as a parameter independent of the sample, but CTT's simplicity makes p-values practical and widely used in classroom and institutional testing contexts.

Practical Guidelines

Calculate p-values for every item after each test administration
Flag items with p < 0.20 or p > 0.90 for review
Check flagged items for ambiguity, cultural bias, or miskeying
Maintain a range of difficulties across the test — a mix of easy, moderate, and hard items
Record p-values in an item bank to track item behaviour across administrations

Key References

Hughes, A. (2003). Testing for Language Teachers (2nd ed.). Cambridge University Press.
Brown, H. D. & Abeywickrama, P. (2010). Language Assessment: Principles and Classroom Practices. Pearson.
Bachman, L. F. & Palmer, A. S. (1996). Language Testing in Practice. Oxford University Press.