Learner Corpus
research-methodologySLA
A learner corpus is a systematic, electronic collection of texts produced by second or foreign language learners. Unlike native-speaker corpora, learner corpora document Interlanguage — the developing language systems of L2 users — enabling large-scale analysis of error patterns, developmental features, and the effects of L1 background, proficiency level, and task type on learner production.
Major Learner Corpora
| Corpus | Full name | Size | Features |
|---|---|---|---|
| ICLE | International Corpus of Learner English | ~5.7 million words | Argumentative essays from advanced university learners; 25+ L1 backgrounds; directed by Sylviane Granger (UCLouvain, 1990-) |
| EFCAMDAT | EF-Cambridge Open Language Database | ~83 million words | Written submissions from EF online learners; 180+ nationalities; all proficiency levels |
| CLC | Cambridge Learner Corpus | ~55 million words | Exam scripts from Cambridge English exams; error-tagged |
| LINDSEI | Louvain International Database of Spoken English Interlanguage | ~1 million words | Spoken interviews from advanced learners; 11 L1 backgrounds |
| ICNALE | International Corpus Network of Asian Learners of English | ~2.3 million words | Written and spoken data from 10 Asian countries |
What Learner Corpora Enable
- Error Analysis at scale — identifying systematic error patterns across thousands of learners, moving beyond individual case studies
- L1 influence — comparing how learners from different L1 backgrounds produce the same target features (Language Transfer, Crosslinguistic Influence)
- Developmental profiling — tracking which features appear at which proficiency levels
- Overuse/underuse analysis — comparing learner and native-speaker corpora to identify what learners produce too much or too little of
- Data-driven learning — using learner corpus findings to inform teaching materials (see Concordance Lines)
Methodology
Learner corpus research follows the principles of Corpus Linguistics but adds learner-specific metadata:
- L1 background — essential for cross-linguistic comparison
- Proficiency level — self-reported, test-based, or institutionally assigned
- Task type — essay, letter, narrative, spoken interview
- Learning context — EFL/ESL, instructed/naturalistic, immersion
- Error annotation — some corpora (CLC) include error tagging; others leave this to researchers
Concerns
- Proficiency assignment — how reliably are learners classified into levels?
- Task effects — different tasks elicit different language; corpus composition affects findings
- Representativeness — most corpora over-represent university students from certain L1 backgrounds
- Comparison corpus — what counts as the "native speaker" baseline? The native-speaker norm itself is contested in the age of ELF
- Access — some major corpora are restricted or require licensing
Key References
- Granger (2002) — learner corpus research: current status and future prospects
- Granger, Dagneaux, Meunier & Paquot (2009) — ICLE
- Geertzen, Alexopoulou & Korhonen (2013) — EFCAMDAT
- Gilquin & Granger (2015) — learner corpus research in SLA
- Meunier & Granger (2008) — Phraseology in Foreign Language Learning and Teaching