Effect Size
An effect size quantifies the magnitude of a difference between groups or the strength of a relationship. In SLA research, it answers: how much did the treatment help, not just whether it helped (which is what p-values do).
Cohen's d
The most common effect size in SLA intervention research. Calculated as the difference between two group means divided by the pooled standard deviation.
Conventional benchmarks (Cohen, 1988):
- d = 0.2 — small
- d = 0.5 — medium
- d = 0.8 — large
These are rules of thumb, not absolute standards. A "small" effect in a high-stakes context (e.g., d = 0.3 for a medical treatment) may be highly meaningful.
Hedges' g
A corrected version of Cohen's d that adjusts for small sample sizes. Preferred when studies have fewer than ~20 participants per group. In practice, d and g are very similar for larger samples.
Why Effect Size Matters More Than p-Values
A study can find a "statistically significant" result (p < .05) with a trivially small effect size if the sample is large enough. Conversely, a meaningful effect can fail to reach significance in a small study. Effect size separates the size of the finding from the confidence we have in it.
In the TBLT Meta-Analysis Debate
| Source | Effect Size | Interpretation |
|---|---|---|
| Bryfonski & McKay (2019) | d = 0.93 | Large — but from methodologically flawed studies |
| Xuan et al. (2022) recalculation | g = 0.61 | Medium — from better-screened subset |
| Norris & Ortega (2000) | d = 0.96 | Large — but outcome measures biased toward explicit knowledge |
The controversy illustrates that a large effect size is only as trustworthy as the studies producing it. Methodological flaws (practice-test congruency, missing pre-tests, design conflation) can all inflate effect sizes.