Corpus Linguistics

Language AnalysisMethodology

Corpus linguistics is the study of language through the systematic analysis of large, electronically stored collections of authentic text (corpora). It is fundamentally a methodology rather than a theory, an empirical approach that has transformed how we describe English and design language teaching materials.

Core Concepts

Corpus (pl. corpora): a principled collection of naturally occurring text, designed to be representative of a language variety. Can be general (e.g., the British National Corpus, 100 million words) or specialised (e.g., a corpus of academic journal articles)
Concordance: a display of every occurrence of a search term in its context, typically in KWIC (Key Word In Context) format; see Concordance Lines
Frequency list: words ranked by how often they occur, the basis for vocabulary selection in teaching
Collocation: statistically significant co-occurrence of words, revealed through corpus analysis
N-gram / cluster: recurring multi-word sequences (on the other hand, in terms of, as a result of)

What Corpus Linguistics Reveals

Corpus analysis has overturned many intuition-based assumptions about English:

Finding	Implication for ELT
The most frequent 2,000 word families cover ~80% of general text	Prioritise high-frequency vocabulary
Many "grammar rules" are probabilistic tendencies, not absolute rules	Teach patterns and typical use, not just rules
Spoken and written English differ systematically	Use appropriate models for each skill
Native-speaker intuitions about frequency and use are often wrong	Base materials on corpus evidence, not intuition
Language is highly formulaic; much of speech and writing consists of recurrent chunks	Support the Lexical Approach and Formulaic Language teaching

Key Corpora

Corpus	Size	What it covers
British National Corpus (BNC)	100M words	British English, 1990s, spoken + written
Corpus of Contemporary American English (COCA)	1B+ words	American English, 1990–present, multiple genres
Cambridge English Corpus	2B+ words	Learner + native, informs Cambridge materials
CANCODE	5M words	Cambridge–Nottingham Corpus of Discourse in English; spoken, informs the Touchstone series and Cambridge Grammar of English
Michigan Corpus of Academic Spoken English (MICASE)	1.8M words	Academic speech
International Corpus of Learner English (ICLE)	3.7M words	L2 English from multiple L1 backgrounds

Applications in Language Teaching

Materials Development

Corpus data ensures that textbooks present language as it is actually used. Frequency-based vocabulary lists (the Academic Word List, the General Service List) are corpus products. Carter and McCarthy's CANCODE work fed the Cambridge Touchstone series, where dialogue lines were checked against spoken-corpus frequencies before publication.

Pedagogic Corpus

Distinct from corpus-informed materials writing, the pedagogic corpus is the cumulative body of texts a course exposes a learner to. Course design becomes corpus engineering: choosing and ordering texts so that high-frequency items recur often enough across the syllabus for patterns to emerge. Reference corpora (BNC, COCA, CANCODE) inform the writer's pen; the pedagogic corpus is what the learner actually meets.

Data-Driven Learning (DDL)

Pioneered by Tim Johns (1991), DDL gives learners direct access to Concordance Lines and asks them to discover patterns inductively, a form of Guided Discovery. Learners examine authentic examples and formulate rules, developing analytical skills alongside language knowledge.

Grammar Description

Corpus-informed grammars (e.g., Biber et al., 1999, Longman Grammar of Spoken and Written English) describe frequency, register variation, and typical patterns, information absent from traditional grammars.

Learner Corpus Research

Analysing corpora of learner writing reveals systematic errors, overuse, underuse, and developmental patterns specific to L1 groups. This informs targeted teaching and assessment.

Limitations

Corpora show what people say, not what is possible or desirable; descriptive, not prescriptive
Representativeness is always a design decision; no corpus perfectly mirrors "the language"
Quantitative patterns can obscure qualitative differences
DDL requires training and is not equally effective with all learner populations

Related Terms