Meta-Analysis in SLA
A meta-analysis statistically combines results from multiple studies on the same question to produce an aggregate effect size. In SLA, meta-analyses are used to estimate the effectiveness of instructional approaches (e.g., corrective feedback, explicit instruction, TBLT).
The Appeal
- Larger combined sample sizes than any single study
- Can reveal patterns invisible in individual studies
- Systematic, replicable procedure
- Carries high authority in academic and policy debates
The Pitfalls
The The Bryfonski-McKay [[TBLT Meta-Analysis in [[SLA|Meta-Analysis]] Controversy|Bryfonski-McKay controversy]] and critiques of Norris & Ortega (2000) illustrate recurring problems:
1. Garbage in, garbage out
If primary studies have methodological flaws (no pre-test, vague control groups, practice-test congruency), aggregating their effect sizes produces a precise-looking but misleading number. Statistical sophistication in the meta-analytic procedure cannot compensate for weak primary research.
2. Construct delineation
The variable of interest must be clearly defined. "TBLT" can mean genuine Task-Supported Language Teaching|task-based teaching]] or task-supported instruction bolted onto a grammar syllabus. "Explicit instruction" can mean PPP, structured input, corrective feedback, or metalinguistic explanation. Lumping dissimilar treatments under one label obscures more than it reveals.
3. Design conflation
Mixing effect sizes from between-group and within-group designs inflates the aggregate. Within-group (pre-post) designs measure all change, including maturation and practice effects, not just treatment effects.
4. Outcome measure bias
Norris & Ortega (2000) has been critiqued (Shin, 2010) because most included studies used outcome measures biased toward explicit knowledge (grammar tests, metalinguistic judgments). If you test explicit knowledge, explicit instruction wins — but that doesn't tell you about communicative ability.
5. Premature synthesis
When the primary research base is too small or too heterogeneous, meta-analysis creates a false impression of robust evidence. Boers & Faez (2023) concluded that the TBLT field is simply "not ripe for such a meta-analysis."
Major SLA Meta-Analyses and Their Critiques
| Meta-Analysis | Claim | Critique |
|---|---|---|
| Norris & Ortega (2000) | Explicit instruction more effective than implicit (d = 0.96) | Outcome measures biased toward explicit knowledge; oversimplified coding (Shin, 2010) |
| Li (2010) | Medium overall effect for CF | Conflated within-group and between-group designs (flagged by Boers et al., 2021) |
| Bryfonski & McKay (2019) | Large effect for TBLT (d = 0.93) | 51 of 52 studies failed rigorous screening; task-based/task-supported conflation |
| Xuan et al. (2022) | Recalculated TBLT effect: g = 0.61 | Better screening but still includes task-supported studies |
Reading Meta-Analyses Critically
Questions to ask:
- How is the construct defined? Would all researchers agree these studies measure the same thing?
- What designs were included? Are between-group and within-group studies pooled?
- What were the inclusion criteria? How many studies were screened vs. included?
- What do the outcome measures test? Explicit knowledge? Implicit knowledge? Communicative ability?
- Is the primary research base sufficient? Enough studies of adequate quality to warrant synthesis?