Keyword Analysis
Keyword analysis is a Corpus Linguistics technique that identifies words occurring with unusual frequency in a target corpus compared to a reference corpus. These statistically significant items — "keywords" — reveal what a text or collection of texts is distinctively about, beyond what general English frequency would predict.
How It Works
- Compile a target corpus — The texts under investigation (e.g., a set of medical research articles)
- Select a reference corpus — A larger, general corpus representing "normal" English (e.g., the BNC or COCA)
- Compare frequencies — For each word, compare its frequency in the target corpus against its expected frequency based on the reference corpus
- Apply a statistical test — Words with statistically significant over- or under-representation are flagged as keywords
The result is a ranked list of items that characterise the target corpus — its aboutness and its distinctive linguistic features.
Statistical Measures
| Measure | Description |
|---|---|
| Log-likelihood (Dunning 1993) | The most widely used keyness statistic; tests significance of frequency differences; good with low-frequency items |
| Chi-squared | Earlier measure; less reliable with sparse data |
| Odds ratio | Measures effect size rather than significance; how much more likely a word is in the target |
| Kullback-Leibler Divergence | Information-theoretic approach; treats keyness as effect size (Gries 2021) |
Mike Scott's WordSmith Tools (1996 onwards) popularised keyword analysis and remains a standard tool. Scott defined a keyword simply as a word occurring with "unusual frequency" in comparison to a reference corpus.
Positive and Negative Keywords
- Positive keywords — Words significantly more frequent in the target corpus (e.g., patient, diagnosis in medical texts)
- Negative keywords — Words significantly less frequent than expected (e.g., absence of informal vocabulary in legal texts)
Both types are informative: negative keywords reveal what a genre avoids as much as what it foregrounds.
Applications in ELT
| Application | Example |
|---|---|
| Genre analysis | Identifying characteristic vocabulary of academic, journalistic, or legal Genres |
| ESP/EAP materials | Extracting domain-specific vocabulary for specialised courses |
| Textbook evaluation | Checking whether a coursebook's vocabulary matches target Register |
| Learner corpus research | Comparing learner output against native speaker corpora to identify overuse/underuse |
| Test development | Ensuring reading passages contain vocabulary appropriate to the target level |
Limitations
Keyword analysis identifies what is distinctive but not why. Interpretation requires qualitative analysis of concordance lines. Additionally, results are heavily influenced by the choice of reference corpus — different references produce different keyword lists.