Meta-Analysis in SLA

A meta-analysis statistically combines results from multiple studies on the same question to produce an aggregate effect size. In SLA, meta-analyses are used to estimate the effectiveness of instructional approaches (e.g., corrective feedback, explicit instruction, TBLT).

The Appeal

Larger combined sample sizes than any single study
Can reveal patterns invisible in individual studies
Systematic, replicable procedure
Carries high authority in academic and policy debates

The Pitfalls

The The Bryfonski-McKay [[TBLT Meta-Analysis in [[SLA|Meta-Analysis]] Controversy|Bryfonski-McKay controversy]] and critiques of Norris & Ortega (2000) illustrate recurring problems:

1. Garbage in, garbage out

If primary studies have methodological flaws (no pre-test, vague control groups, practice-test congruency), aggregating their effect sizes produces a precise-looking but misleading number. Statistical sophistication in the meta-analytic procedure cannot compensate for weak primary research.

2. Construct delineation

The variable of interest must be clearly defined. "TBLT" can mean genuine Task-Supported Language Teaching|task-based teaching]] or task-supported instruction bolted onto a grammar syllabus. "Explicit instruction" can mean PPP, structured input, corrective feedback, or metalinguistic explanation. Lumping dissimilar treatments under one label obscures more than it reveals.

3. Design conflation

Mixing effect sizes from between-group and within-group designs inflates the aggregate. Within-group (pre-post) designs measure all change, including maturation and practice effects, not just treatment effects.

4. Outcome measure bias

Norris & Ortega (2000) has been critiqued (Shin, 2010) because most included studies used outcome measures biased toward explicit knowledge (grammar tests, metalinguistic judgments). If you test explicit knowledge, explicit instruction wins — but that doesn't tell you about communicative ability.

5. Premature synthesis

When the primary research base is too small or too heterogeneous, meta-analysis creates a false impression of robust evidence. Boers & Faez (2023) concluded that the TBLT field is simply "not ripe for such a meta-analysis."

Major SLA Meta-Analyses and Their Critiques

Meta-Analysis	Claim	Critique
Norris & Ortega (2000)	Explicit instruction more effective than implicit (d = 0.96)	Outcome measures biased toward explicit knowledge; oversimplified coding (Shin, 2010)
Li (2010)	Medium overall effect for CF	Conflated within-group and between-group designs (flagged by Boers et al., 2021)
Bryfonski & McKay (2019)	Large effect for TBLT (d = 0.93)	51 of 52 studies failed rigorous screening; task-based/task-supported conflation
Xuan et al. (2022)	Recalculated TBLT effect: g = 0.61	Better screening but still includes task-supported studies

Reading Meta-Analyses Critically

Questions to ask:

How is the construct defined? Would all researchers agree these studies measure the same thing?
What designs were included? Are between-group and within-group studies pooled?
What were the inclusion criteria? How many studies were screened vs. included?
What do the outcome measures test? Explicit knowledge? Implicit knowledge? Communicative ability?
Is the primary research base sufficient? Enough studies of adequate quality to warrant synthesis?