Between-Group Study

SLAResearch Methodologybetween-group designbetween-subjects designcomparison study

A between-group study (also called between-subjects design) compares two or more groups that receive different treatments. In SLA research, this typically means a treatment group (e.g., receiving TBLT instruction) compared to a control or comparison group (e.g., receiving traditional instruction or no instruction). Each participant is in only one condition.

Why It Matters

Between-group designs are the standard for causal claims about instructional effectiveness. The logic: both groups experience the same passage of time, the same maturation, and the same external factors. The only systematic difference is the treatment. Therefore, post-treatment differences can be attributed to the treatment itself.

Between-Group vs. Within-Group

Feature	Between-group	Within-group (pre-post)
Groups	Two or more separate groups	Same group tested twice
What it measures	Treatment-specific effect	Total change (all causes)
Causal logic	Stronger: comparison group controls for non-treatment factors	Weaker: cannot separate treatment from maturation, practice effects, regression to the mean
Effect size magnitude	Generally smaller	Generally larger
Effect Size benchmarks (Plonsky & Oswald, 2014)	0.40 / 0.70 / 1.00	0.60 / 1.00 / 1.40

Within-group designs cannot rule out: natural maturation, practice effects from repeated testing, regression to the mean (especially if learners were selected for low performance), or the Hawthorne effect.

Requirements for a Sound Between-Group Study

Pre-test for both groups: establishes baseline equivalence. Without this, post-treatment differences might reflect pre-existing ability gaps, not the treatment.
Random assignment (ideal) or demonstrated equivalence: ensures groups are comparable. True randomisation is the gold standard because it probabilistically equalises all confounding variables.
Clear description of both groups: you need to know what each group actually did. A vague "traditional instruction" label for the control group makes interpretation impossible.
Same outcome measure: both groups assessed with the same instrument under the same conditions.

Effect Size Calculation

Posttest-only: d = (M_treatment − M_control) / SD_pooled

With pre-test adjustment (preferred; Morris, 2008): d = [(M_treatment_post − M_treatment_pre) − (M_control_post − M_control_pre)] / SD_pooled_pre

The pre-test-adjusted formula uses the pre-test SD as the standardiser (uncontaminated by the treatment) and controls for any baseline differences. The choice of denominator can shift d by approximately 0.3, potentially moving an effect from "small" to "medium."

The Problem of Quasi-Experimental Designs

Most SLA classroom research uses intact classes rather than true random assignment. This makes them quasi-experimental rather than truly experimental. Reasons:

Institutions assign students to classes; researchers cannot reshuffle them
Ethical concerns about denying instruction to a control group
Practical constraints of timetables and administration
Only an estimated 16–25% of experimental SLA/CALL studies use true random assignment

With intact groups, systematic differences between classes (different teachers, time slots, motivation levels) become potential confounds. This makes pre-testing, transparent comparison group descriptions, and careful equivalence checks especially critical.

The Problem of Design Conflation

Pooling effect sizes from between-group and within-group studies in a single meta-analysis inflates the overall estimate. Within-group d values are inherently larger because they capture all change, not just treatment-specific effects. This was a core problem in the Bryfonski-McKay controversy: of 52 studies, only 27 were between-group designs, and even among those, many lacked pre-tests or adequate control group descriptions.

Other Threats to Validity

Attrition: participants dropping out differentially between groups can destroy equivalence. If weaker students drop out of the treatment group, the remaining group looks artificially better.
Practice-Test Congruency: if the treatment group practises activities similar to the post-test but the control group does not, better performance may reflect test familiarity rather than genuine learning.
Teacher effects: if different teachers teach each group, any differences might reflect the teacher rather than the treatment.

In the TBLT Meta-Analysis Debate

Source	Effect Size	Note
Bryfonski & McKay (2019)	d = 0.93	From 27 between-group studies, but many lacked pre-tests
Xuan et al. (2022)	g = 0.61	Recalculated from 16 better-screened studies
Boers et al. (2021)	Only 1 of 27 survived	Rigorous screening left a single usable study

The takeaway for consumers of meta-analyses: always check whether the aggregate was calculated from between-group comparisons, within-group comparisons, or a mix of both.

Related Terms