Graded Reader AI Pipeline
How to use NLP tools and LLMs to produce publisher-quality graded reader content: the tools, the failure modes, and a realistic Python workflow.
The Problem Space
An LLM like GPT-4 or Claude can write fluent, engaging narrative. The challenge is that LLMs do not naturally respect headword lists. Left unconstrained, a model writing a "Stage 2 reader" at 700 headwords will routinely use words from the 3,000–5,000 frequency band, including words like "hesitate", "deliberately", and "momentum", because these feel natural to a fluent writer trained on the entire internet.
The gap between what LLMs do naturally and what graded readers require is both a vocabulary problem and a grammar problem. Solving it requires either constraining the model during generation or auditing and rewriting after the fact, or both.
NLP Tools for Vocabulary Profiling
These are the key instruments for analysing text before and after generation.
AntWordProfiler (Laurence Anthony)
Free desktop software from Laurence Anthony (Waseda University). Takes a text and a word list (e.g., Nation's BNC/COCA 1K, 2K, 3K, 4K lists or the NGSL-GR), then outputs:
- What percentage of tokens fall within each frequency band
- Which specific tokens fall outside the target list (flagged as off-list)
- Type-token breakdown by band
The workhorse tool for graded reader vocabulary QA. Pairs naturally with Nation's lists or the NGSL-GR. Limitation: desktop only, no API, no Python integration. For pipeline automation you need to replicate its logic programmatically.
- Download: Laurence Anthony's site at Waseda University
VocabProfile / Lextutor (Tom Cobb)
Web-based vocabulary profiler at lextutor.ca. Uses Nation's original GSL/BNC lists. Older but widely cited in research. Outputs percentage of text at each frequency band (K1, K2, AWL, off-list). Useful for cross-checking AntWordProfiler results. No Python API; results must be scraped or used manually.
NGSL Profiler (New General Service List Project)
Online tool specifically designed for graded reader authors. Profiles text against the NGSL-GR's 11 bands. Highlights off-band vocabulary. Can suggest simpler alternatives. The NGSL-GR is the most purpose-built list for graded reader production, derived from a 273-million-word corpus.
- Available at: newgeneralservicelist.com/ngsl-profiler-new
TextInspector (textinspector.com)
Web-based tool that won the 2017 ELTons Digital Innovation award. Integrates with the English Vocabulary Profile (EVP), Cambridge University Press's CEFR-mapped vocabulary database that specifies at which CEFR level each word/phrase/collocation is typically learned.
Outputs:
- Word-by-word CEFR level tags (A1–C2)
- Percentage of text at each CEFR level
- Flagging of above-level vocabulary
- Integration with BNC/COCA frequency data and the Academic Word List
The EVP integration makes TextInspector the most CEFR-precise tool for graded reader QA. Paid subscription for full features; free tier available.
Python Libraries
For programmatic pipeline integration:
| Library | What it does |
|---|---|
spacy | Tokenisation, lemmatisation, POS tagging: foundation for all profiling |
spacy-readability | Adds Flesch-Kincaid, Flesch Reading Ease, Dale-Chall, SMOG to spaCy Doc objects |
textdescriptives | Comprehensive text metrics via spaCy pipe: readability, coherence, POS distribution, dependency distance |
lexicalrichness | Type-token ratio (TTR), MATTR, MTLD, vocd: lexical diversity metrics |
taaled | Advanced lexical diversity (MATTR, MTLD): more robust than raw TTR for texts of different lengths |
nltk | Core text processing; useful for sentence segmentation and frequency counts |
pandas / polars | Storing and filtering flagged words |
A working vocabulary profiler in Python requires only spacy + a frequency-band word list in dictionary/set form. The logic is: lemmatise each token, look up in band dictionary, report.
Key NLP Metrics to Track
These are the metrics that matter for graded reader QA. Track them per passage, not just per full text.
Vocabulary Coverage Metrics
- Headword coverage %: what percentage of lemmatised tokens fall within the target headword list. Target: ≥98%.
- Off-list token count: raw number of tokens outside the target list (easier to act on than a percentage).
- Off-list type count: number of unique off-list words (types, not tokens). A word that appears 10 times is 10 tokens but 1 type. Type count tells you how many substitutions you need to make.
- Frequency band distribution: % of text in K1/K2/K3/K4 (Nation) or NGSL bands. Should be front-loaded in lower bands.
Lexical Diversity Metrics
- Type-Token Ratio (TTR): unique words / total words. Raw TTR is unreliable across different text lengths; use MATTR (Moving Average TTR) instead, which calculates TTR over a sliding window of fixed size (typically 100 words).
- MTLD (Measure of Textual Lexical Diversity): the most stable diversity measure, not affected by text length. Graded readers intentionally have lower MTLD than authentic texts because variety is sacrificed for controlled repetition.
- VPF profile (Vocabulary Profile: Frequency bands): the shape of the frequency distribution. A well-graded text at Stage 2 should have >70% of tokens in K1, <25% in K2, <5% beyond K2.
Syntactic Metrics
- Mean sentence length (MSL): total words / total sentences. See Graded Reader Construction for level norms.
- Mean dependency distance (MDD): the average syntactic distance between words in dependency parses. Lower MDD = simpler syntax. Computable with spaCy. Research (Liu, 2008) shows MDD is a more sensitive measure of syntactic complexity than clause count or sentence length alone.
- Proportion of subordinate clauses: POS + dependency tagging can flag subordinate clause markers (because, although, which, that, when used as subordinators). Should be low at Starter/Stage 1.
- Passive voice rate: spaCy dependency labels can identify passive constructions (
nsubjpassorauxpass). Oxford Bookworms limits passives to Stage 2+.
Readability Scores
- Flesch-Kincaid Grade Level: fast and widely understood, but L1-calibrated. Use as a relative guide, not an absolute target.
- Flesch Reading Ease: targets 70–85 for graded readers at lower levels.
- SMOG Index: uses polysyllabic word count. More robust than FK for medical/academic texts but less common in ELT contexts.
- Lexile score: useful if the material will be assessed against US standards; less relevant for pure EFL.
LLM Failure Modes at Vocabulary Control
Understanding why LLMs fail at this task is essential for designing the mitigation strategy.
1. Vocabulary Leakage
LLMs select words by token probability, not frequency-band membership. A model trained on web text will instinctively choose "exhausted" over "very tired" because it is stylistically richer, but "exhausted" sits at K3 frequency, above a 700-headword limit. This is the primary failure mode. It is not random; it is systematic: models consistently over-use mid-frequency expressive vocabulary.
2. Grammar Complexity Drift
Even when instructed to use simple grammar, LLMs drift toward relative clauses, complex nominal groups, and embedded conditionals in longer passages. The model's quality heuristic rewards linguistic sophistication; the graded reader requirement punishes it. In a 500-word passage, a single model call will typically violate 3–6 grammar level rules even with detailed system prompt constraints.
3. Structural Inconsistency Across Passages
When generating a chapter at a time, the model does not track which new headwords it has already introduced. Chapter 3 may introduce "shiver" as if new even though it appeared in Chapter 1. There is no built-in recycling logic.
4. Proper Noun and Cultural Reference Contamination
Models default to culturally specific references that require significant background knowledge: brand names, idioms rooted in cultural events, sports metaphors from one culture, etc. A model told to write about "a family in difficulty" may produce references to "filing for Chapter 11" or "calling 911" that are invisible failures from a vocabulary standpoint but serious cultural access failures.
5. Prompt Degradation at Scale
A meticulously crafted system prompt that works well for a single paragraph loses effectiveness over longer generations. The further into the text, the more the model's attention dilutes the vocabulary constraints. This is not a capability limitation but a context management limitation, verified by the research finding that prompting alone achieves ~40% comprehensibility for A1 targets (Yang & Klein 2021 / FUDGE research), versus 84% with post-generation discriminators.
Constraint Techniques: What Actually Works
Level 1: System Prompt Engineering (Necessary but Insufficient)
A well-structured system prompt should:
- State the target headword level explicitly ("Write using only the most frequent 700 English word families")
- Provide a short positive and negative example ("Use 'tired' not 'exhausted', 'big' not 'enormous'")
- Specify forbidden grammar structures explicitly ("Do not use: past perfect, relative clauses, modals other than can/will/must")
- State approximate sentence length ("Keep sentences under 14 words on average")
This alone will reduce violations but will not eliminate them. Research consistently shows prompting achieves ~40–50% compliance for strict low-level constraints without additional mechanisms.
Level 2: Post-Generation Vocabulary Profiling + Flagging
Run every generated passage through an automated profiler immediately after generation. Flag:
- Off-list lemmas (headword violations)
- Sentences exceeding the length target
- Forbidden grammatical constructions (via dependency parse)
Return the flagged items to the LLM with an edit pass prompt: "Revise the following passage. Replace the highlighted words with simpler alternatives from the list below. Do not change plot or character. Maintain sentence length under 14 words."
This edit-pass approach outperforms single-shot generation for compliance. Two or three passes typically converge to acceptable coverage.
Level 3: Iterative Feedback Loops
A full loop:
generate → profile → flag violations → revise → re-profile → accept/reject
Key design decisions:
- Define an acceptance threshold (e.g., ≥97.5% headword coverage, zero grammar violations above level)
- Set a maximum number of revision passes (3 is typically sufficient; more passes produce diminishing returns and can destabilise the text)
- Keep revision prompts surgical: "replace word X with Y" rather than "rewrite the whole passage", which introduces new violations
Level 4: Constrained Decoding (Advanced)
For developers working with open-weight models (LLaMA, Mistral, etc.), constrained decoding directly manipulates the probability distribution at the token level. Techniques:
Logit bias: set the logit score of off-list tokens to −∞ before sampling, preventing the model from ever generating them. This is implementable in most inference frameworks. The challenge is that LLM tokenisers use subword tokens (BPE), so "exhausted" may tokenise as "exha" + "usted"; the constraint must operate at the word boundary level, not raw token level.
FUDGE (Future Discriminators for Generation), Yang & Klein (2021, NAACL): A modular controlled generation method. A separate "attribute discriminator" model is trained to predict whether a partial sequence will satisfy a target attribute (e.g., CEFR A2 level). At each generation step, the discriminator adjusts the base LLM's output probabilities to favour in-attribute continuations. FUDGE does not require access to model weights; it works externally on logits. Applied to language learning, FUDGE improved comprehensibility from 39.4% (prompting only) to 83.3% (Yang applied this to conversational difficulty control).
Outlines / XGrammar: Libraries for FSM-based constrained decoding. More useful for enforcing structured output formats (JSON, grammar rules) than lexical vocabulary lists, but applicable with creativity.
Grammar/regex-constrained decoding: Can enforce sentence length caps and basic structure rules at generation time. Experimental for ELT use but technically feasible.
Level 5: Fine-Tuned CEFR-Aligned Models
The research frontier. The paper "From Tarzan to Tolkien" (Margatina et al., 2024, ACL Findings) evaluated few-shot prompting, supervised fine-tuning (SFT), and reinforcement learning (RL) for CEFR-controlled generation. Key findings:
- GPT-4 with prompting significantly outperforms smaller open-source models with prompting
- Fine-tuning + RL ("CALM," CEFR-Aligned Language Model) surpasses GPT-4 at a fraction of the cost
- The gap between GPT-4 and open-source models closes substantially with fine-tuning
- RL alignment specifically helps with low CEFR levels (A1–A2) where constraint is strictest
A companion 2025 Springer paper ("Automatic generation of ESL learning materials based on CEFR levels using reinforcement-tuned LLMs") added multi-objective reward shaping and constrained decoding, achieving 12.3% improvement in CEFR classification accuracy over state-of-the-art baselines, with 15.6% reduction in misalignment errors.
A Realistic Python Pipeline
A working pipeline for a graded reader author who codes in Python.
Architecture
[Input: story outline + headword level]
→ [Generate passage: LLM with system prompt]
→ [Profile: spaCy lemmatiser + frequency-band lookup]
→ [Flag: off-list words + grammar violations]
→ [Decision: pass / revise]
→ [Revise: LLM edit pass with flagged items]
→ [Re-profile]
→ [Accept if ≥ threshold, else reject + log]
→ [Output: validated passage]
→ [Append to document, track new headwords introduced]
Implementation Notes
Vocabulary profiling function (core logic):
import spacy
from collections import Counter
nlp = spacy.load("en_core_web_sm")
def profile_text(text: str, target_list: set[str]) -> dict:
doc = nlp(text)
tokens = [t for t in doc if not t.is_punct and not t.is_space]
total = len(tokens)
off_list = [t for t in tokens if t.lemma_.lower() not in target_list]
return {
"total_tokens": total,
"off_list_tokens": len(off_list),
"coverage_pct": (total - len(off_list)) / total * 100,
"off_list_types": list(set(t.lemma_.lower() for t in off_list)),
"mean_sentence_length": total / len(list(doc.sents)),
}
Target list construction: Load a headword list (e.g., the NGSL-GR band file as a plain text list) into a Python set. Proper nouns (capitalised mid-sentence) should be excluded from coverage calculation because only proper nouns deserve free passage. Use spaCy's NER (t.ent_type_) to identify and exclude them.
Grammar check (simple version): Count sentences exceeding max length. For subordinate clause detection, use spaCy dependency labels: any token with dep_ in {"relcl", "advcl", "csubj", "xcomp"} is a subordinate clause signal.
Revision prompt (edit pass):
You are editing a graded reader passage at [LEVEL] (approx [N] headwords).
The following words in the passage are too advanced. Replace each with a simpler
alternative using only common, everyday English.
Words to replace:
[LIST OF OFF-LIST WORDS WITH SENTENCE CONTEXT]
Rules:
- Do not change character names or place names
- Keep the same meaning and plot
- Keep sentences under [MAX_LENGTH] words
- Do not introduce new grammar structures beyond [ALLOWED_STRUCTURES]
Return only the revised passage.
Tracking new headwords: Maintain a running set of "introduced headwords" per document. Each new headword should appear in a high-support context on first occurrence. Flag headwords that appear only once (under-recycled) at the end of a chapter.
Readability scoring: Use spacy-readability or textdescriptives to add FK grade level and reading ease to each passage automatically as part of the profile run.
What to Log per Passage
passage_id | word_count | coverage_pct | off_list_types |
mean_sent_length | fk_grade | flesch_ease |
revision_passes | acceptance_status | new_headwords_introduced
This log becomes the quality audit trail and informs cumulative recycling tracking across chapters.
Controllable Text Generation: Academic Background
"Controllable text generation" is the NLP research area covering generation of text constrained to specified attributes (sentiment, style, readability, topic, vocabulary level). Relevant to graded reader production:
Lexical Simplification: replacing complex words with simpler alternatives while preserving meaning. Well-studied in NLP (CWI: Complex Word Identification + LS: Lexical Substitution pipeline). Used in text simplification for accessibility (plain language) and for EFL materials. Tools like BERT-based substitution models can suggest in-band alternatives for off-list words.
Text Simplification (automatic): broader than lexical; includes sentence splitting, grammar simplification, discourse restructuring. The ASSET and Newsela datasets provide simplified English training data. Less directly applicable to graded reader production (which requires narrative coherence) but useful for factual/non-fiction graded readers.
The CALM paper (Margatina et al., 2024): specifically addresses the graded reader-adjacent problem of controlling proficiency level. CEFR alignment is treated as the target attribute; the model is fine-tuned with RL to optimise for it. The paper demonstrates that this is a tractable NLP task with current models.
Practical Recommendations for Q's Use Case
Given you are building IELTS programme materials (structured non-fiction/academic English) rather than pure fiction graded readers, a few specific notes:
-
Use TextInspector as the primary QA tool for CEFR-profiling passages. Its English Vocabulary Profile integration is more precise than frequency-band-only tools for determining whether academic vocabulary is level-appropriate.
-
The NGSL-GR is the right list for fiction-adjacent content. For IELTS academic content, pair it with the AWL (Academic Word List) and profile against both.
-
Set acceptance threshold at 97% for IELTS materials rather than 98%; IELTS learners are explicitly preparing to encounter unknown vocabulary in context, so a slightly lower coverage threshold is pedagogically defensible.
-
Build the Python profiler first before scaling generation. Without fast automated profiling you cannot know whether your prompting strategy is working. One afternoon with spaCy + the NGSL-GR word list gives you a tool you can reuse on every passage.
-
Use the edit-pass architecture (generate → profile → flag → revise) rather than trying to constrain generation perfectly on the first pass. This is more reliable with current API-based models than constrained decoding (which requires access to logits).
-
FUDGE is worth understanding even if you do not implement it. The finding that prompting alone achieves ~40% compliance for A1/A2 targets is a useful calibration for expectations and justifies the post-processing investment.
Key References
- Yang, K. & Klein, D. (2021). FUDGE: Controlled Text Generation With Future Discriminators. NAACL 2021. arXiv:2104.05218. GitHub: yangkevin2/naacl-2021-fudge-controlled-generation
- Margatina, K. et al. (2024). From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation. ACL 2024 Findings. arXiv:2406.03030
- [Anonymous] (2025). Automatic generation of ESL learning materials based on CEFR levels using reinforcement-tuned LLMs. Discover Artificial Intelligence, Springer Nature. doi:10.1007/s44163-025-00762-3
- Claridge, G. (2012). Graded readers: How the publishers make the grade. Reading in a Foreign Language, 24(1).
- Nation, I.S.P. (2001). Learning Vocabulary in Another Language. Cambridge University Press.
- Hu, M. & Nation, I.S.P. (2000). Unknown vocabulary density and reading comprehension. Reading in a Foreign Language, 13(1).
- Liu, H. (2008). Dependency distance as a metric of language comprehension difficulty. Journal of Cognitive Science, 9(2), 159–191.
See Also
- Graded Reader Construction: the craft side: vocabulary systems, grammar syllabuses, publisher philosophies
- Graded Reader: overview and series comparison tables
- Word Families: Nation's word family levels
- BNC COCA Headword Lists (2K 3K 4K): frequency lists used in profiling
- AI in Language Teaching: broader AI/ELT overview
- Comprehensible Input: the i+1 principle the pipeline operationalises