Surface and Deep Text Features
A working distinction in readability, lexical analysis, and automated text scoring. Surface features are properties of a text you can compute by counting tokens: no parsing, no meaning, no lookup against an external resource. Deep features require linguistic processing, external knowledge, or both.
The classic surface trio is sentence length (words ÷ sentences), word length (syllables or characters ÷ words), and word count itself. These are what every traditional readability formula — Flesch Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog Index, SMOG, the Dale-Chall Readability Formula — runs on, with the partial exception of Dale-Chall, which adds a frequency-list lookup and so already crosses into deep territory.
In a regression context, "predictor" is just the statistical term for the independent variable plugged in to estimate an outcome. Flesch (1948) regressed two surface predictors against the McCall-Crabbs grade-calibrated passages; the formula's coefficients are the regression weights those two surface counts earned.
What the surface can't see
Surface features are blind to anything that requires interpretation. Word photosynthesis and word personality are both five syllables, but one is a technical term most readers meet rarely and the other a common psychological abstraction; a surface formula treats them identically. A 12-word sentence with three nested relative clauses processes very differently from a 12-word coordinate sentence; surface formulas score them the same. An anaphor four sentences from its antecedent costs working memory; surface formulas don't notice anaphors at all.
The deep features that do see these things include vocabulary frequency (lookup against word-frequency lists or corpora), syntactic complexity (parse-tree depth, clause embedding, branching direction), referential cohesion (anaphor distance, lexical overlap between sentences), connective density, topical familiarity, and the broader schema demand a text places on its reader. Computing them needs a parser, a lemmatiser, a frequency list, sometimes a discourse model: substantially more machinery than a syllable counter.
Why the distinction matters
Surface features are cheap, consistent across implementations (modulo tokeniser and syllabifier choices), and correlate well enough with text difficulty to be useful as a coarse first filter. They are also why every consumer readability tool produces a number quickly and why those numbers diverge from human judgement in predictable ways.
Deep features sit closer to the construct of readability itself but require more infrastructure, are less stable across tools (different parsers, different cohesion definitions), and historically lived in research-only systems like Coh-Metrix. The current generation of transformer-based text-level classifiers short-circuits the surface/deep split by learning features end-to-end from labelled data. They implicitly capture both kinds without exposing them as named predictors, which is why their explanations of why a text is C1 are harder to audit than a Flesch score.
For ELT use the practical heuristic is: surface formulas to triage and rank, deep features (or human judgement) to validate and decide. Reporting both kinds together is more informative than either alone, and substantially more honest about the text-difficulty construct than reporting a single Flesch number.
Key References
- Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221–233.
- Graesser, A. C., McNamara, D. S., Louwerse, M. M. & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers, 36(2), 193–202.
- Crossley, S. A., Greenfield, J. & McNamara, D. S. (2008). Assessing text readability using cognitively based indices. TESOL Quarterly, 42(3), 475–493.
See Also
- Readability: the broader construct surface formulas approximate
- Flesch Reading Ease and Flesch-Kincaid Grade Level: the canonical surface formulas
- Coh-Metrix: the canonical deep-feature toolkit
- Text Metric Implementation Variance: why even surface features drift across tools
- CEFR Text-Level Classification: how learned features collapse the distinction