Generalizability Theory
Generalizability theory (G-theory) is a framework for analysing measurement error in which multiple sources — raters, tasks, occasions, criteria — are estimated jointly through analysis-of-variance procedures rather than collapsed into a single reliability coefficient. The framework was set out by Cronbach, Gleser, Nanda, and Rajaratnam in their 1972 monograph The Dependability of Behavioral Measurements, building on a series of papers from the 1960s.
Logic
Classical test theory partitions an observed score into true and error components but does not differentiate among error sources. G-theory replaces the single error term with variance components attributable to each facet of the measurement design and to interactions between facets. A G-study estimates these components from a fully crossed or partially nested design — typically persons by raters by tasks. A subsequent decision study (D-study) combines those components to project the dependability of scores under alternative measurement conditions: more raters, fewer tasks, single occasion, multiple occasions.
Coefficients
Two summary coefficients are reported. The generalizability coefficient (Eρ²) parallels reliability for relative decisions — ranking examinees against one another. The dependability index (Φ) handles absolute decisions — pass/fail against a fixed cut-score — and is generally lower because it includes additional sources of error that do not affect ranking. Both range from 0 to 1.
Use in language testing
G-theory is the dominant framework for analysing performance assessments in which raters and tasks both contribute to score variance. Studies of TOEFL iBT speaking, IELTS Writing, and Cambridge English speaking modules use G-theory to determine, for example, how many tasks and how many raters are needed to reach a target dependability. In compositional research, Brennan's Generalizability Theory (2001) provides the canonical computational treatment, and the GENOVA, urGENOVA, and mGENOVA programs implement the analyses. G-theory and the many-facet Rasch model are complementary: the former emphasises variance decomposition, the latter individual rater and task estimates.
References
- Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. Wiley.
- Brennan, R. L. (2001). Generalizability Theory. Springer.
- Bachman, L. F. (1990). Fundamental Considerations in Language Testing. Oxford University Press.