Effect Size
An effect size quantifies the magnitude of a difference between groups or the strength of a relationship. In SLA research, it answers: how much did the treatment help, not just whether it helped (which is what p-values do).
Cohen's d
The most common effect size in SLA intervention research. Calculated as the difference between two group means divided by the pooled standard deviation.
Conventional benchmarks (Cohen, 1988):
- d = 0.2: small
- d = 0.5: medium
- d = 0.8: large
These are rules of thumb, not absolute standards. A "small" effect in a high-stakes context (e.g., d = 0.3 for a medical treatment) may be highly meaningful.
Hedges' g
A corrected version of Cohen's d that uses a pooled standard deviation that accounts for unequal group sizes. The pooled SD formula weights each group's variance by its degrees of freedom. Preferred when group standard deviations differ (S₁ ≠ S₂). For larger, equal-sized groups, d and g converge.
Hedges' adjusted g
A small-sample correction applied to Hedges' g. When the total sample (n₁ + n₂) is small, especially below 20, both Cohen's d and Hedges' g overestimate the true effect. Hedges' adjusted g multiplies g by a correction factor k that shrinks toward 1 as sample size grows. This gives more weight to larger studies when computing the average effect size across studies in a meta-analysis.
Shin (2010) demonstrated that Norris & Ortega (2000) used unweighted Cohen's d, giving equal weight to all 49 data samples regardless of sample size. Since 88.5% had N ≤ 30, the resulting average effect size (d = 0.96) was likely inflated by sampling error from small studies. Hedges' adjusted g would have reduced this estimate by down-weighting the small-sample studies that showed the most extreme effect sizes.
| Statistic | Handles S₁ ≠ S₂ | Handles small N |
|---|---|---|
| Unstandardized ES | No | No |
| Cohen's d | No | No |
| Hedges' g | Yes | No |
| Hedges' adjusted g | Yes | Yes |
Why Effect Size Matters More Than p-Values
A study can find a "statistically significant" result (p < .05) with a trivially small effect size if the sample is large enough. Conversely, a meaningful effect can fail to reach significance in a small study. Effect size separates the size of the finding from the confidence we have in it.
In the TBLT Meta-Analysis Debate
| Source | Effect Size | Interpretation |
|---|---|---|
| Bryfonski & McKay (2019) | d = 0.93 | Large, but from methodologically flawed studies |
| Xuan et al. (2022) recalculation | g = 0.61 | Medium, from better-screened subset |
| Norris & Ortega (2000) | d = 0.96 | Large, but outcome measures biased toward explicit knowledge |
The controversy illustrates that a large effect size is only as trustworthy as the studies producing it. Methodological flaws (practice-test congruency, missing pre-tests, design conflation) can all inflate effect sizes.