Learner Corpus

Research MethodologySLA

A learner corpus is a systematic, electronic collection of texts produced by second or foreign language learners. Unlike native-speaker corpora, learner corpora document Interlanguage, the developing language systems of L2 users, enabling large-scale analysis of error patterns, developmental features, and the effects of L1 background, proficiency level, and task type on learner production.

Major Learner Corpora

Corpus	Full name	Size	Features
ICLE	International Corpus of Learner English	~5.7 million words	Argumentative essays from advanced university learners; 25+ L1 backgrounds; directed by Sylviane Granger (UCLouvain, 1990-)
EFCAMDAT	EF-Cambridge Open Language Database	~83 million words	Written submissions from EF online learners; 180+ nationalities; all proficiency levels
CLC	Cambridge Learner Corpus	~55 million words	Exam scripts from Cambridge English exams; error-tagged
LINDSEI	Louvain International Database of Spoken English Interlanguage	~1 million words	Spoken interviews from advanced learners; 11 L1 backgrounds
ICNALE	International Corpus Network of Asian Learners of English	~2.3 million words	Written and spoken data from 10 Asian countries

What Learner Corpora Enable

Error Analysis at scale: identifying systematic error patterns across thousands of learners, moving beyond individual case studies
L1 influence: comparing how learners from different L1 backgrounds produce the same target features (Language Transfer, Crosslinguistic Influence)
Developmental profiling: tracking which features appear at which proficiency levels
Overuse/underuse analysis: comparing learner and native-speaker corpora to identify what learners produce too much or too little of
Data-driven learning: using learner corpus findings to inform teaching materials (see Concordance Lines)

Methodology

Learner corpus research follows the principles of Corpus Linguistics but adds learner-specific metadata:

L1 background: essential for cross-linguistic comparison
Proficiency level: self-reported, test-based, or institutionally assigned
Task type: essay, letter, narrative, spoken interview
Learning context: EFL/ESL, instructed/naturalistic, immersion
Error annotation: some corpora (CLC) include error tagging; others leave this to researchers

Concerns

Proficiency assignment: how reliably are learners classified into levels?
Task effects: different tasks elicit different language; corpus composition affects findings
Representativeness: most corpora over-represent university students from certain L1 backgrounds
Comparison corpus: what counts as the "native speaker" baseline? The native-speaker norm itself is contested in the age of ELF
Access: some major corpora are restricted or require licensing

References

Granger (2002): learner corpus research, current status and future prospects
Granger, Dagneaux, Meunier & Paquot (2009): ICLE
Geertzen, Alexopoulou & Korhonen (2013): EFCAMDAT
Gilquin & Granger (2015): learner corpus research in SLA
Meunier & Granger (2008): Phraseology in Foreign Language Learning and Teaching

Related Terms