Meta-Analysis in SLA
A meta-analysis statistically combines results from multiple studies on the same question to produce an aggregate effect size. Unlike a narrative literature review (which summarises qualitatively), a meta-analysis converts each study's findings into a common metric, then computes a weighted average, producing a single estimate with greater statistical power than any individual study. The term was coined by Gene Glass in 1976. In SLA, the method became prominent after Norris & Ortega (2000).
Key Statistical Concepts
Effect Size: the standardised magnitude of a treatment effect. The two main types in SLA are Cohen's d and Hedges' g (which corrects for small-sample bias). Plonsky & Oswald (2014) argued that Cohen's general benchmarks (0.2 / 0.5 / 0.8) underestimate typical SLA effects and proposed field-specific benchmarks:
| Design | Small | Medium | Large |
|---|---|---|---|
| Between-group | 0.40 | 0.70 | 1.00 |
| Within-group (pre-post) | 0.60 | 1.00 | 1.40 |
Heterogeneity: the degree to which effect sizes vary across studies beyond sampling error. Measured by Cochran's Q (a chi-squared test of whether variability exceeds chance) and I² (the percentage of variability due to true heterogeneity: 25% low, 50% moderate, 75% high). High heterogeneity signals that studies are not measuring the same thing, which triggers moderator analysis.
Moderator analysis: investigating variables that explain heterogeneity. Subgroup analysis for categorical moderators (e.g., instructional setting, feedback type); meta-regression for continuous ones (e.g., treatment duration, sample size).
Forest plot: visual display of each study's effect size and confidence interval, plus the overall weighted mean (the diamond at the bottom). Overlapping CIs suggest low heterogeneity.
Funnel plot: a scatter of effect sizes against precision (standard error). Should form a symmetrical inverted funnel if no bias exists. Asymmetry suggests publication bias, with studies with null results missing from the literature.
Publication bias (the "file drawer problem"): the tendency for significant/large effects to get published while null results stay unpublished, inflating the aggregate. Detected via funnel plots, Egger's regression test, or trim-and-fill analysis.
Steps in a Meta-Analysis
Following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines:
- Define the research question: specify the construct, population, and outcome types. Develop inclusion/exclusion criteria.
- Systematic search: multiple databases (ERIC, LLBA, PsycINFO, Web of Science) plus grey literature (dissertations, conference papers) to reduce publication bias.
- Screening: typically 2–3 independent reviewers. Report via PRISMA flow diagram. Calculate inter-rater reliability.
- Coding: classify study features: research design (between-group vs. within-group), sample characteristics, treatment type, duration, outcome measure type. This defines the moderator variables.
- Effect size calculation: compute or extract from means, SDs, t-values, F-values, or p-values. Choose between fixed-effects (assumes one true effect) and random-effects model (assumes effects vary; more common and appropriate in SLA).
- Analysis: weighted mean effect size, heterogeneity tests, moderator analyses, publication bias assessment.
- Reporting: aggregate effects with confidence intervals, forest and funnel plots, moderator results, limitations.
The Appeal
- Larger combined sample sizes than any single study
- Can reveal patterns invisible in individual studies
- Systematic, replicable procedure
- Carries high authority in academic and policy debates
The Pitfalls
The Bryfonski-McKay controversy and critiques of Norris & Ortega (2000) illustrate recurring problems:
1. Garbage in, garbage out
If primary studies have methodological flaws (no pre-test, vague control groups, practice-test congruency), aggregating their effect sizes produces a precise-looking but misleading number. Statistical sophistication in the meta-analytic procedure cannot compensate for weak primary research.
2. Construct delineation ("apples and oranges")
The variable of interest must be clearly defined. "TBLT" can mean genuine task-based teaching or task-supported instruction bolted onto a grammar syllabus. "Explicit instruction" can mean PPP, structured input, corrective feedback, or metalinguistic explanation. Lumping dissimilar treatments under one label obscures more than it reveals.
3. Design conflation
Mixing effect sizes from between-group and within-group designs inflates the aggregate. Within-group (pre-post) designs measure all change, including maturation and practice effects, not just treatment effects. This is why Plonsky & Oswald proposed separate benchmarks.
4. Data quality and inclusion criteria
Shin (2010) showed that Norris & Ortega (2000) used no quality criteria for selecting primary studies, adopting an "inclusive approach" that admitted studies regardless of methodological rigour. Specific problems:
- Randomization: 47% (23/49) of data samples did not report random assignment; pretests were often missing or showed significant group differences (e.g., Scott, 1989)
- Small samples: 28.2% had N ≤ 10; 88.5% had N ≤ 30, which is too small for random assignment to guarantee equivalence
- Instrument validity: Some studies used tests with as few as 4–5 target items; others recycled pretest items on the posttest
- Coding consistency: Several studies reported no inter-rater agreement for coding think-aloud or verbal protocol data
Subsequent meta-analyses (7 reviewed by Shin) perpetuated these problems; all 7 adopted the same inclusive approach with little attention to data quality.
5. Outcome measure bias
Norris & Ortega (2000) has also been critiqued (Shin, 2010) because most included studies used outcome measures biased toward explicit knowledge (grammar tests, metalinguistic judgments). If you test explicit knowledge, explicit instruction wins, but that doesn't tell you about communicative ability.
6. Premature synthesis
When the primary research base is too small or too heterogeneous, meta-analysis creates a false impression of robust evidence. Boers & Faez (2023) concluded that the TBLT field is simply "not ripe for such a meta-analysis."
Major SLA Meta-Analyses
| Meta-Analysis | Claim | Critique |
|---|---|---|
| Norris & Ortega (2000) | Explicit instruction more effective than implicit (d ≈ 0.96) | No quality criteria for inclusion; 47% lacked randomization; used Cohen's d not Hedges' g (ignoring sampling error); oversimplified FonF/FonFS coding; outcome measures biased toward explicit knowledge (Shin, 2010) |
| Li (2010) | Medium overall effect for CF (d = 0.61–0.64) | Conflated within-group and between-group designs (flagged by Boers et al., 2021) |
| Lyster & Saito (2010) | CF effective and durable; prompts > recasts | 15 classroom studies (N = 827); more focused but smaller base |
| Plonsky & Oswald (2014) | Field-specific benchmarks from 91 meta-analyses | Methodological, not topical; redefined how SLA interprets effect sizes |
| Bryfonski & McKay (2019) | Large effect for TBLT (d = 0.93) | 51 of 52 studies failed rigorous screening; task-based/task-supported conflation |
| Xuan et al. (2022) | Recalculated TBLT effect: g = 0.61 | Better screening but still includes task-supported studies |
Key Methodologists
- Luke Plonsky: Northern Arizona University. Leading SLA meta-analysis methodologist. Plonsky & Oswald (2014) on field-specific benchmarks; co-author of An A–Z of Applied Linguistics Research Methods (2016) with Loewen.
- John Norris: ETS. Co-author of the landmark 2000 meta-analysis. Also known for task-based language assessment.
- Lourdes Ortega: Georgetown University. Co-author of Norris & Ortega (2000), Pimsleur Award winner. Author of Understanding Second Language Acquisition (2009). Champions the "bilingual turn" in SLA.
- Shawn Loewen: Michigan State University. Instructed SLA, quantitative research methodology. Co-author with Plonsky on research methods.
Reading Meta-Analyses Critically
Questions to ask:
- How is the construct defined? Would all researchers agree these studies measure the same thing?
- What designs were included? Are between-group and within-group studies pooled? Were experimental and quasi-experimental results separated?
- What were the inclusion criteria? Were quality indicators applied (randomization, instrument validity, coding consistency)? Or was an "inclusive approach" used?
- What do the outcome measures test? Explicit knowledge? Implicit knowledge? Communicative ability?
- Is the primary research base sufficient? Enough studies of adequate quality to warrant synthesis?
- What does the funnel plot look like? Is there evidence of publication bias? Was a sensitivity analysis conducted?
- How heterogeneous are the results? What does I² tell you?
- Which effect size statistic was used? Cohen's d or Hedges' adjusted g? With small samples (N < 20), Cohen's d and unweighted Hedges' g inflate the estimate.
- Were moderating variables coded? Target language, learner proficiency, age, ESL vs EFL context, and treatment duration; failure to code these can mask meaningful differences (Shin, 2010).