Corpus Linguistics
Corpus linguistics is the study of language through the systematic analysis of large, electronically stored collections of authentic text (corpora). It is fundamentally a methodology rather than a theory — an empirical approach that has transformed how we describe English and design language teaching materials.
Core Concepts
- Corpus (pl. corpora): a principled collection of naturally occurring text, designed to be representative of a language variety. Can be general (e.g., the British National Corpus, 100 million words) or specialised (e.g., a corpus of academic journal articles)
- Concordance: a display of every occurrence of a search term in its context, typically in KWIC (Key Word In Context) format — see Concordance Lines
- Frequency list: words ranked by how often they occur — the basis for vocabulary selection in teaching
- Collocation: statistically significant co-occurrence of words, revealed through corpus analysis
- N-gram / cluster: recurring multi-word sequences (on the other hand, in terms of, as a result of)
What Corpus Linguistics Reveals
Corpus analysis has overturned many intuition-based assumptions about English:
| Finding | Implication for ELT |
|---|---|
| The most frequent 2,000 word families cover ~80% of general text | Prioritise high-frequency vocabulary |
| Many "grammar rules" are probabilistic tendencies, not absolute rules | Teach patterns and typical use, not just rules |
| Spoken and written English differ systematically | Use appropriate models for each skill |
| Native-speaker intuitions about frequency and use are often wrong | Base materials on corpus evidence, not intuition |
| Language is highly formulaic — much of speech and writing consists of recurrent chunks | Support the Lexical Approach and Formulaic Language teaching |
Key Corpora
| Corpus | Size | What it covers |
|---|---|---|
| British National Corpus (BNC) | 100M words | British English, 1990s, spoken + written |
| Corpus of Contemporary American English (COCA) | 1B+ words | American English, 1990–present, multiple genres |
| Cambridge English Corpus | 2B+ words | Learner + native, informs Cambridge materials |
| Michigan Corpus of Academic Spoken English (MICASE) | 1.8M words | Academic speech |
| International Corpus of Learner English (ICLE) | 3.7M words | L2 English from multiple L1 backgrounds |
Applications in Language Teaching
Materials Development
Corpus data ensures that textbooks present language as it is actually used. Frequency-based vocabulary lists (the Academic Word List, the General Service List) are corpus products.
Data-Driven Learning (DDL)
Pioneered by Tim Johns (1991), DDL gives learners direct access to Concordance Lines and asks them to discover patterns inductively — a form of Guided Discovery. Learners examine authentic examples and formulate rules, developing analytical skills alongside language knowledge.
Grammar Description
Corpus-informed grammars (e.g., Biber et al., 1999, Longman Grammar of Spoken and Written English) describe frequency, register variation, and typical patterns — information absent from traditional grammars.
Learner Corpus Research
Analysing corpora of learner writing reveals systematic errors, overuse, underuse, and developmental patterns specific to L1 groups. This informs targeted teaching and assessment.
Limitations
- Corpora show what people say, not what is possible or desirable — descriptive, not prescriptive
- Representativeness is always a design decision — no corpus perfectly mirrors "the language"
- Quantitative patterns can obscure qualitative differences
- DDL requires training and is not equally effective with all learner populations