Learner Corpus
Research MethodologySLA
A learner corpus is a systematic, electronic collection of texts produced by second or foreign language learners. Unlike native-speaker corpora, learner corpora document Interlanguage, the developing language systems of L2 users, enabling large-scale analysis of error patterns, developmental features, and the effects of L1 background, proficiency level, and task type on learner production.
Major Learner Corpora
| Corpus | Full name | Size | Features |
|---|---|---|---|
| ICLE | International Corpus of Learner English | ~5.7 million words | Argumentative essays from advanced university learners; 25+ L1 backgrounds; directed by Sylviane Granger (UCLouvain, 1990-) |
| EFCAMDAT | EF-Cambridge Open Language Database | ~83 million words | Written submissions from EF online learners; 180+ nationalities; all proficiency levels |
| CLC | Cambridge Learner Corpus | ~55 million words | Exam scripts from Cambridge English exams; error-tagged |
| LINDSEI | Louvain International Database of Spoken English Interlanguage | ~1 million words | Spoken interviews from advanced learners; 11 L1 backgrounds |
| ICNALE | International Corpus Network of Asian Learners of English | ~2.3 million words | Written and spoken data from 10 Asian countries |
What Learner Corpora Enable
- Error Analysis at scale: identifying systematic error patterns across thousands of learners, moving beyond individual case studies
- L1 influence: comparing how learners from different L1 backgrounds produce the same target features (Language Transfer, Crosslinguistic Influence)
- Developmental profiling: tracking which features appear at which proficiency levels
- Overuse/underuse analysis: comparing learner and native-speaker corpora to identify what learners produce too much or too little of
- Data-driven learning: using learner corpus findings to inform teaching materials (see Concordance Lines)
Methodology
Learner corpus research follows the principles of Corpus Linguistics but adds learner-specific metadata:
- L1 background: essential for cross-linguistic comparison
- Proficiency level: self-reported, test-based, or institutionally assigned
- Task type: essay, letter, narrative, spoken interview
- Learning context: EFL/ESL, instructed/naturalistic, immersion
- Error annotation: some corpora (CLC) include error tagging; others leave this to researchers
Concerns
- Proficiency assignment: how reliably are learners classified into levels?
- Task effects: different tasks elicit different language; corpus composition affects findings
- Representativeness: most corpora over-represent university students from certain L1 backgrounds
- Comparison corpus: what counts as the "native speaker" baseline? The native-speaker norm itself is contested in the age of ELF
- Access: some major corpora are restricted or require licensing
Key References
- Granger (2002): learner corpus research, current status and future prospects
- Granger, Dagneaux, Meunier & Paquot (2009): ICLE
- Geertzen, Alexopoulou & Korhonen (2013): EFCAMDAT
- Gilquin & Granger (2015): learner corpus research in SLA
- Meunier & Granger (2008): Phraseology in Foreign Language Learning and Teaching