Text Metric Implementation Variance
The same text scored on the same formula by two different tools will give two different numbers. This is not a bug in either tool. It is a structural property of how text metrics are defined: the published formulas operate on inputs — token counts, sentence counts, syllable counts, complex-word counts, lemmas — that the formulas themselves do not specify how to derive. Each tool fills those gaps with its own heuristics, and the heuristics drift.
Treating any single tool's output as ground truth confuses a measurement with the measurement. There is no canonical implementation of FKGL, Gunning Fog, or MTLD. Web tools, R packages, Python libraries, and academic reference implementations all disagree by characteristic margins on identical text.
Where the variance comes from
Four mechanisms account for nearly all the disagreement between tools.
Tokenization. What counts as a word? Hyphenated compounds (one token or two), contractions (don't → one or two), numbers, URLs, and punctuation-attached forms each have defensible answers. Different tokenizers produce token counts that differ by 1–3% on prose, more on technical text.
Sentence segmentation. Periods inside abbreviations (U.S., Dr.), semicolons, colons, and bullet lists each force a heuristic call. Sentence count enters every readability formula as a denominator, so even small differences shift the score.
Syllable estimation. The Carnegie Mellon Pronouncing Dictionary covers ~134k forms and is the gold standard; tools that fall back to a CMU lookup (e.g. Text Inspector) tend to read slightly higher on grade-level metrics because CMU counts more syllables in dense academic vocabulary than letter-pattern heuristics do. Pure heuristic syllable counters (typical of letter-pattern Python implementations using pyphen or vowel-cluster rules) are systematically lower by a small margin. This single difference accounts for most of the FKGL and Gunning Fog drift between tools.
Complex-word definition. Gunning Fog's original specification excludes proper nouns, compound words, and words made polysyllabic only by common suffixes (-ed, -es, -ing). Almost no implementation actually applies the full exclusion list. Tools closer to the spec read lower; tools that count every 3+-syllable token read higher.
For MTLD specifically, the McCarthy and Jarvis (2010) specification averages a forward and a backward pass over the text. Some implementations run forward only, which moves the score by 5–15 points on the same text. Lemmatization choices add another few points of variance.
Characteristic margins
Inter-tool drift is large enough to notice but small enough to be uninformative about either tool's correctness. On the same prose passage:
- Flesch Reading Ease: 1–5 points typical, on a 0–100 scale.
- Flesch-Kincaid Grade: 0.5–1.5 grade levels typical.
- Gunning Fog: 0.5–2 points typical.
- MTLD: 5–15 points typical, ~3–10% of the value.
A delta inside these bands is noise. A delta outside them is worth investigating. A delta that flips direction across passages — Tool A reads passage X harder, Tool B reads passage Y harder — is the signal that something is genuinely broken.
What this means for tool choice and validation
The right way to validate a new readability or lexical-diversity tool against an established one is rank correlation across a corpus, not number-matching on a single text. If a candidate tool ranks ten reading passages in the same order that Text Inspector does, the two are equivalent for a calibration job, regardless of whether their absolute FKGL numbers match. If the rankings disagree, the variance has crossed from calibration drift into genuine construct disagreement and the tool is not yet a substitute.
The same logic applies to test-design work. A passage's readability score is meaningful as a comparison against other passages scored by the same tool. Citing FKGL 9.5 from one tool and FKGL 11.0 from another and treating the gap as evidence is a category error.
Document the trade. Any tool that produces a calibrated metric should publish, alongside the number, the syllable counter, tokenizer, and definitional choices it makes. Without that disclosure, the user cannot tell calibration drift from genuine disagreement.
Worked example: heuristic-syllable tool vs Text Inspector
A single passage scored by Text Inspector (CMU-based) and a Python implementation using a letter-pattern syllable counter:
| Metric | Text Inspector (CMU) | Heuristic-syllable | Δ | Within typical band? |
|---|---|---|---|---|
| Flesch Reading Ease | 55.74 | 56.55 | 0.81 | yes (1–5) |
| Flesch-Kincaid Grade | 12.14 | 11.19 | 0.95 | yes (0.5–1.5) |
| Gunning Fog | 15.03 | 14.15 | 0.88 | yes (0.5–2) |
| MTLD | 142.82 | 137.01 | 5.81 (~4%) | yes (3–10%) |
All four deltas land inside expected inter-tool variance, and all four point the same direction — the heuristic-syllable tool reads marginally easier across the board. The directional consistency is the diagnostic: it identifies the cause as a single conservative call (the heuristic syllable counter relative to a CMU lookup), not a broken implementation. The tools agree on the kind of text this is; they disagree by the calibration distance their syllable counters introduce.
Key References
- McCarthy, P. M. & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392.
- Kincaid, J. P., Fishburne, R. P., Rogers, R. L. & Chissom, B. S. (1975). Derivation of new readability formulas for Navy enlisted personnel. NTTC Research Branch Report 8-75.
- Gunning, R. (1952). The Technique of Clear Writing. McGraw-Hill. (Original spec for the polysyllabic-word exclusion list.)
- DuBay, W. H. (2004). The Principles of Readability. Impact Information.
See Also
- Readability: the broader construct and what its formulas approximate
- Flesch-Kincaid Grade Level: the metric most affected by syllable-counter choice
- Text Inspector: CMU-syllable reference implementation in this comparison
- Coh-Metrix: discourse-level alternative when surface-feature variance matters too much