Type-Token Ratio
The number of unique word forms (types) in a text divided by the total number of running words (tokens). A text of 100 tokens with 60 distinct word forms has a TTR of 0.60. The simplest and oldest operationalisation of Lexical Diversity, introduced in stylometric work in the early twentieth century and still the default in many entry-level analysis tools.
The length-sensitivity problem
TTR is mathematically guaranteed to fall as text length grows. Every text in any language exhausts its productive vocabulary within a few hundred tokens; after that, additional tokens are mostly repetitions of function words (the, of, and) and previously introduced content words. The numerator approaches an asymptote while the denominator keeps rising, so the ratio decays. Two texts by the same author at different lengths will produce different TTRs even if their underlying vocabulary range is identical.
The consequence is that TTR is only meaningful for comparisons across texts of the same length. It cannot rank a 250-word essay against a 400-word essay, and it cannot track a learner's diversity growth across writings of different lengths without correction.
What TTR is and is not
TTR is a useful within-corpus reporting statistic when text lengths are tightly controlled — fixed-length essay tasks, normalised speech samples, or windowed segments of larger texts. It is also pedagogically useful as a first concept when introducing learners or trainee teachers to the type/token distinction.
TTR is not a valid measure of cross-text lexical diversity. Every modern lexical-diversity index (MTLD, vocd-D, HD-D, MATTR) exists because TTR fails the length test. Treating raw TTR as a learner ability indicator across heterogeneous samples is a category error common enough to have its own warning in the McCarthy and Jarvis (2010) validation paper.
The closest length-stable cousin is the moving-average TTR (MATTR), which slides a fixed-size window across the text and averages the within-window TTRs. MATTR keeps TTR's interpretability while neutralising most of the length effect.
Lemmatisation and counting choices
A note on tokens: implementations differ on whether run, runs, running, and ran count as one type or four. Lemmatised TTR collapses inflectional forms; surface TTR keeps them distinct. The two produce systematically different scores on the same text and are not interchangeable, which is one of the variance sources catalogued in Text Metric Implementation Variance. Tools that compute TTR should publish whether they lemmatise.
Key References
- Templin, M. C. (1957). Certain Language Skills in Children: Their Development and Interrelationships. University of Minnesota Press. (Early TTR application in child language research.)
- Richards, B. (1987). Type/token ratios: What do they really tell us? Journal of Child Language, 14(2), 201–209.
- Tweedie, F. J. & Baayen, R. H. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32(5), 323–352.
- McCarthy, P. M. & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392.
See Also
- Lexical Diversity: the umbrella construct TTR was the first attempt to measure
- MATTR: the length-stable repair of TTR
- MTLD: a different repair, sequence-based instead of window-averaged
- Text Metric Implementation Variance: lemmatisation and tokenisation choices that move TTR scores between tools