HD-D
A direct, deterministic measure of Lexical Diversity introduced by McCarthy and Jarvis (2007) and validated in McCarthy and Jarvis (2010). HD-D — hypergeometric distribution of diversity — computes analytically what vocd-D approximates by random sampling, so it returns the same score on every run of the same tool against the same text.
How it works
For each lexical type t in a text of length N, HD-D computes the probability of encountering at least one token of t in a random draw of n tokens (canonically n = 42, the midpoint of the 35–50 sample range that vocd-D uses). The probability comes from the hypergeometric distribution:
where ft is the count of type t in the text. Each type's contribution to HD-D is P × (1/n), and the score is the sum across all types. The output is a number between 0 and 1: high means the text is diverse enough that almost any random 42-token sample contains a wide range of types; low means random samples tend to be dominated by a small repeated vocabulary.
The 42-token draw size is not magic. It is the midpoint of the range Malvern and Richards (2002) used in vocd-D, chosen for backward comparability. Other draw sizes give different absolute scores on the same construct.
What HD-D fixes about vocd-D
vocd-D draws hundreds of random samples and fits a theoretical curve to the resulting empirical TTR-vs-length cloud. The random-sampling step introduces stochastic noise: rerun vocd-D on the same text and the score moves by a few points, more on short texts. HD-D computes the hypergeometric probabilities directly from the type frequencies, with no random sampling at all. The score is reproducible to the last decimal across runs and across implementations.
McCarthy and Jarvis (2010) report that HD-D and vocd-D correlate at r > 0.97 across their validation corpus — they measure the same thing — and conclude that HD-D should replace vocd-D in research where reproducibility matters. The two indices remain side by side in tools mainly because the older child-language and learner-corpus literature reports vocd-D.
Use alongside MTLD
The validation paper's headline recommendation is to report HD-D and MTLD together. The two indices capture lexical diversity through different mathematical lenses (probability of encountering types in random samples for HD-D, mean length of TTR-stable spans for MTLD), and convergent scores from both are more diagnostic than either alone. Disagreement between the two flags texts where the diversity construct itself is unstable: very short texts, texts with bursty topical vocabulary, or texts where a few high-frequency types dominate.
Typical ranges
HD-D scores cluster between roughly 0.75 and 0.92 for academic prose. Native-speaker writing typically scores 0.85–0.90; learner writing at B1–B2 scores 0.78–0.85; very simple controlled language scores below 0.75. The 0–1 range makes HD-D less intuitive than MTLD for non-specialists but more directly interpretable as a probability.
Key References
- McCarthy, P. M. & Jarvis, S. (2007). vocd: A theoretical and empirical evaluation. Language Testing, 24(4), 459–488.
- McCarthy, P. M. & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392.
See Also
- vocd-D: the random-sampling predecessor HD-D supersedes
- MTLD: the sequence-based index McCarthy and Jarvis recommend reporting alongside HD-D
- Lexical Diversity: the umbrella construct both indices target