ELTiverse

Search Terms

Search for ELT terms and concepts

Corpus Linguistics

Language AnalysisMethodology

Corpus linguistics is the study of language through the systematic analysis of large, electronically stored collections of authentic text (corpora). It is fundamentally a methodology rather than a theory — an empirical approach that has transformed how we describe English and design language teaching materials.

Core Concepts

  • Corpus (pl. corpora): a principled collection of naturally occurring text, designed to be representative of a language variety. Can be general (e.g., the British National Corpus, 100 million words) or specialised (e.g., a corpus of academic journal articles)
  • Concordance: a display of every occurrence of a search term in its context, typically in KWIC (Key Word In Context) format — see Concordance Lines
  • Frequency list: words ranked by how often they occur — the basis for vocabulary selection in teaching
  • Collocation: statistically significant co-occurrence of words, revealed through corpus analysis
  • N-gram / cluster: recurring multi-word sequences (on the other hand, in terms of, as a result of)

What Corpus Linguistics Reveals

Corpus analysis has overturned many intuition-based assumptions about English:

FindingImplication for ELT
The most frequent 2,000 word families cover ~80% of general textPrioritise high-frequency vocabulary
Many "grammar rules" are probabilistic tendencies, not absolute rulesTeach patterns and typical use, not just rules
Spoken and written English differ systematicallyUse appropriate models for each skill
Native-speaker intuitions about frequency and use are often wrongBase materials on corpus evidence, not intuition
Language is highly formulaic — much of speech and writing consists of recurrent chunksSupport the Lexical Approach and Formulaic Language teaching

Key Corpora

CorpusSizeWhat it covers
British National Corpus (BNC)100M wordsBritish English, 1990s, spoken + written
Corpus of Contemporary American English (COCA)1B+ wordsAmerican English, 1990–present, multiple genres
Cambridge English Corpus2B+ wordsLearner + native, informs Cambridge materials
Michigan Corpus of Academic Spoken English (MICASE)1.8M wordsAcademic speech
International Corpus of Learner English (ICLE)3.7M wordsL2 English from multiple L1 backgrounds

Applications in Language Teaching

Materials Development

Corpus data ensures that textbooks present language as it is actually used. Frequency-based vocabulary lists (the Academic Word List, the General Service List) are corpus products.

Data-Driven Learning (DDL)

Pioneered by Tim Johns (1991), DDL gives learners direct access to Concordance Lines and asks them to discover patterns inductively — a form of Guided Discovery. Learners examine authentic examples and formulate rules, developing analytical skills alongside language knowledge.

Grammar Description

Corpus-informed grammars (e.g., Biber et al., 1999, Longman Grammar of Spoken and Written English) describe frequency, register variation, and typical patterns — information absent from traditional grammars.

Learner Corpus Research

Analysing corpora of learner writing reveals systematic errors, overuse, underuse, and developmental patterns specific to L1 groups. This informs targeted teaching and assessment.

Limitations

  • Corpora show what people say, not what is possible or desirable — descriptive, not prescriptive
  • Representativeness is always a design decision — no corpus perfectly mirrors "the language"
  • Quantitative patterns can obscure qualitative differences
  • DDL requires training and is not equally effective with all learner populations

Related Terms