Keyword Analysis

Language AnalysisResearch Methodology

Keyword analysis is a Corpus Linguistics technique that identifies words occurring with unusual frequency in a target corpus compared to a reference corpus. These statistically significant items, "keywords," reveal what a text or collection of texts is distinctively about, beyond what general English frequency would predict.

How It Works

Compile a target corpus: The texts under investigation (e.g., a set of medical research articles)
Select a reference corpus: A larger, general corpus representing "normal" English (e.g., the BNC or COCA)
Compare frequencies: For each word, compare its frequency in the target corpus against its expected frequency based on the reference corpus
Apply a statistical test: Words with statistically significant over- or under-representation are flagged as keywords

The result is a ranked list of items that characterise the target corpus: its aboutness and its distinctive linguistic features.

Statistical Measures

Measure	Description
Log-likelihood (Dunning 1993)	The most widely used keyness statistic; tests significance of frequency differences; good with low-frequency items
Chi-squared	Earlier measure; less reliable with sparse data
Odds ratio	Measures effect size rather than significance; how much more likely a word is in the target
Kullback-Leibler Divergence	Information-theoretic approach; treats keyness as effect size (Gries 2021)

Mike Scott's WordSmith Tools (1996 onwards) popularised keyword analysis and remains a standard tool. Scott defined a keyword simply as a word occurring with "unusual frequency" in comparison to a reference corpus.

Positive and Negative Keywords

Positive keywords: Words significantly more frequent in the target corpus (e.g., patient, diagnosis in medical texts)
Negative keywords: Words significantly less frequent than expected (e.g., absence of informal vocabulary in legal texts)

Both types are informative: negative keywords reveal what a genre avoids as much as what it foregrounds.

Applications in ELT

Application	Example
Genre analysis	Identifying characteristic vocabulary of academic, journalistic, or legal Genres
ESP/EAP materials	Extracting domain-specific vocabulary for specialised courses
Textbook evaluation	Checking whether a coursebook's vocabulary matches target Register
Learner corpus research	Comparing learner output against native speaker corpora to identify overuse/underuse
Test development	Ensuring reading passages contain vocabulary appropriate to the target level

Limitations

Keyword analysis identifies what is distinctive but not why. Interpretation requires qualitative analysis of concordance lines. Additionally, results are heavily influenced by the choice of reference corpus; different references produce different keyword lists.

Related Terms