Linear Regression
A statistical procedure for estimating one variable from one or more others using a straight-line equation. The dominant data-analytic tool in mid-twentieth-century educational measurement, and the engine behind every classical readability formula.
The core idea is to fit a line of the form that minimises the squared distance between the line's predictions and the observed values. The squared-distance criterion is what ordinary least squares (OLS) means; "fit" is the colloquial label for "what the procedure does".
The vocabulary
Dependent variable (): the thing being predicted or explained. The outcome. In Flesch (1948) the dependent variable is reading difficulty as scored against the McCall-Crabbs grade-calibrated passages. In an item-analysis regression of test scores on item responses, the outcome is total test score.
Independent variable / predictor (): a thing plugged in to estimate the outcome. Predictor is the more common term in educational and psychological measurement; independent variable is the older statistical name. A regression with one predictor is simple; with several, multiple.
Coefficient (): the weight a predictor earns in the fitted equation. Numerically it is the change in the outcome associated with a one-unit change in that predictor, holding the others fixed. Flesch's coefficient of on average sentence length means that adding one word per sentence reduces the FRE score by about one point.
Intercept (): the predicted value of the outcome when every predictor is zero. Often interpretively meaningless on its own (a text with zero sentences and zero syllables has no defined difficulty) but mathematically essential as the line's anchor point. Flesch's intercept of has no human interpretation; it exists so the rest of the formula lands in the right numerical neighbourhood.
Regressing on : the verb form. We regressed reading difficulty on sentence length and word length means we ran a regression with those two predictors and reading difficulty as the outcome.
Fit / goodness of fit / : how well the equation explains the data. is the proportion of variance in the outcome accounted for by the predictors, scaled 0 to 1. Flesch reported for his two-predictor model, meaning about 70% of the variance in McCall-Crabbs difficulty was captured by sentence length and syllable rate alone — high enough to publish, low enough to leave 30% of the construct unaccounted for. That residual is where every limitation of surface formulas lives.
Residual: the difference between an observed value and the line's prediction for it. A text whose actual difficulty is harder than its FRE score predicts has a positive residual.
Where it shows up in ELT
Most quantitative ELT and SLA research stands on a regression somewhere. Beyond the readability formulas:
MTLD and vocd-D derivation. Both lexical-diversity indices fit curves to data, with vocd-D using a non-linear regression of TTR against sample size to estimate the parameter . The published sentences "vocd-D regresses TTR against tokens" and "MTLD's stability factor of 0.72 is a regression-tuned threshold" both refer to this fit step.
Item analysis. Classical-test-theory item discrimination is, in one common formulation, the point-biserial correlation between item response and total score; promotion to a multivariate model regressing total score on each item gives the item's incremental contribution. Haladyna et al. (2002) tabulate empirical findings that all came from regressions of this kind.
Meta-analysis. Effect-size meta-analyses regress study-level outcomes on study-level moderators (treatment intensity, learner age, instructional setting) to identify what predicts treatment effectiveness. Reading Bryfonski & McKay (2019) on TBLT or Norris & Ortega (2000) on L2 instruction means reading regression output, even when the prose presents it as plain percentages.
CEFR-aligned classifiers. The pre-transformer generation of CEFR text-level classifiers used logistic or ordinal regression of CEFR level on lexical and syntactic features. The current transformer generation has moved beyond regression but still reports performance as classification accuracy against the same regression-era benchmarks.
What regression does and does not give you
Regression gives you the best-fitting line under the squared-error criterion, with associated standard errors and significance tests for each coefficient. It does not give you causation, even when the predictor is plausibly causal. It also does not give you good predictions outside the range of the training data — a readability formula calibrated on grade-school passages does not transfer cleanly to academic prose, which is one reason FKGL produces strange numbers on highly nominalised text.
Regression assumes the relationship is linear in the parameters, that residuals are normally distributed, that predictors are not strongly correlated with each other (multicollinearity), and that errors are constant in size across the range of predictors (homoscedasticity). These assumptions are routinely violated in published ELT work and routinely ignored in interpretation. Reading regression-based claims with the assumptions in mind is most of what reading them critically amounts to.
Key References
- Cohen, J., Cohen, P., West, S. G. & Aiken, L. S. (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (3rd ed.). Lawrence Erlbaum.
- Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.). Sage. Chapters 8–10.
- Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221–233. (Worked example of a two-predictor OLS fit.)
See Also
- Surface and Deep Text Features: what gets used as the predictors in readability research
- Effect Size: regression's natural companion in meta-analysis
- Item Analysis: regression in classical test theory
- Flesch Reading Ease / Flesch-Kincaid Grade Level: textbook two-predictor OLS fits