Classical Test Theory
Classical Test Theory (CTT) is the traditional framework for understanding measurement in testing. Its central equation is deceptively simple:
where X is the observed score, T is the true score (the hypothetical error-free score), and E is the error of measurement. Every observed score contains some degree of random error — CTT provides the tools to estimate how much.
CTT has dominated language testing and educational measurement since the early 20th century. While modern approaches like Item Response Theory (IRT) address some of its limitations, CTT remains the practical foundation for most classroom and institutional testing because of its computational simplicity and modest sample-size requirements.
Core Assumptions
- The true score is the expected value of the observed score across an infinite number of independent administrations of the same test
- Error scores are random — they have an expected value of zero
- Errors are uncorrelated with true scores — knowing someone's true ability tells you nothing about their error
- Errors across tests are uncorrelated — error on one test is unrelated to error on another
These assumptions are untestable in practice (we never observe true scores), but they provide a workable framework for estimating reliability and measurement error.
Key Concepts in CTT
Reliability
Reliability in CTT is defined as the proportion of observed score variance attributable to true score variance:
A reliability coefficient of 0.85 means 85% of the variance in observed scores reflects real differences between test takers; 15% is error. Common estimation methods include:
| Method | What it estimates |
|---|---|
| Test-retest | Stability over time |
| Parallel forms | Equivalence across test versions |
| Split-half | Internal consistency (single administration) |
| Cronbach's alpha | Internal consistency — the most widely reported |
Standard Error of Measurement (SEM)
The SEM estimates how much an individual's observed score might vary due to measurement error:
A test with SD = 10 and reliability = 0.84 has SEM = 4. A candidate scoring 65 has a true score likely falling between 61 and 69 (±1 SEM). This has direct implications for cut score decisions — candidates near the boundary may be misclassified by chance.
Item Statistics
CTT provides the two core item-level statistics used in item analysis:
- Item Difficulty (p-value) — proportion answering correctly
- Item Discrimination (D or point-biserial) — how well the item separates strong from weak candidates
Both are straightforward to calculate and interpret, which is why CTT-based item analysis is standard practice even in contexts where IRT would be theoretically preferable.
Limitations
| Limitation | Explanation |
|---|---|
| Sample dependence | Item statistics (difficulty, discrimination) depend on the sample tested — the same item looks different with different groups |
| Test dependence | Person ability estimates depend on which items were administered |
| Equal error assumption | CTT assumes measurement error is the same for all ability levels — in reality, tests measure more precisely in the middle of the score range |
| Total score focus | CTT works with total scores; it cannot model the probability of a specific response to a specific item |
IRT addresses all four limitations by modelling the relationship between item parameters and person ability on a common scale, independent of the particular sample or test form. However, IRT requires larger samples (typically 200+) and specialised software.
CTT vs IRT in Practice
For most language teaching contexts — classroom tests, progress tests, placement tests, institutional achievement exams — CTT is sufficient and practical. IRT is essential for large-scale, high-stakes testing programmes (IELTS, TOEFL, Cambridge exams) where test equating, adaptive testing, and item banking across forms are required.
Key References
- Bachman, L. F. (1990). Fundamental Considerations in Language Testing. Oxford University Press.
- Brown, J. D. (2005). Testing in Language Programs. McGraw-Hill.
- Crocker, L. & Algina, J. (1986). Introduction to Classical and Modern Test Theory. Holt, Rinehart and Winston.
See Also
- Reliability — the central quality index in CTT
- Item Analysis — applied CTT at the item level
- Item Difficulty — the p-value, a CTT item statistic
- Item Discrimination — D and point-biserial, CTT discrimination measures