Reading Comprehension Test Design
Reading comprehension test design is the principled construction of passage-and-question item sets that measure a defined reading construct. The craft sits at the intersection of applied linguistics, psychometrics, and cognitive psychology, and operates across three coupled surfaces: the passage, the question, and the distractor. Each surface has a distinct research literature and a distinct set of failure modes; quality emerges only when the three are designed jointly against an explicit specification grounded in a TLU domain.
The passage
Passage choice sets a possibility envelope: it determines what an item can plausibly measure. Genre, topic familiarity, and linguistic complexity are the three principal levers, and they interact.
Genre research consistently finds expository and argumentative texts harder than narrative texts for adolescent and adult readers. Unfamiliar rhetorical structures, abstract noun-heavy register, and the absence of story-grammar scaffolding raise load. The practical consequence is that narrative passages are fairer at lower CEFR bands while expository passages discriminate better at higher ones. A test built on narrative alone cannot separate B2 from C1 readers.
Topic familiarity is the most under-managed confound. When candidates know a topic, they answer correctly even where they cannot fully decode the text. Background knowledge inflates the score in ways unrelated to reading ability. Professional test writers select obscure-but-accessible topics: novel enough that no candidate brings prior schema, generic enough that the linguistic surface is the only path to the answer. This matters acutely in EFL contexts where culturally familiar topics will systematically advantage some candidates.
Text Complexity is operationalised through readability indices and lexical-cohesion metrics: Flesch-Kincaid, Coh-Metrix, and CEFR-calibrated scales. The headline finding from research is that text difficulty alone does not predict item difficulty. A simple passage with a buried inference question is harder than a complex passage with a literal one. Text and task must be considered jointly, which is what a proper test specification forces.
The question
Question taxonomies fall into two traditions.
The cognitive tradition runs through Bloom's Taxonomy (remember → understand → apply → analyse → evaluate → create). It dominates EFL textbook analysis, where studies consistently show questions clustered at the lower three levels, which is precisely where higher-order reading comprehension is not tested.
The reading-specific tradition runs through Barrett's Taxonomy (literal / reorganisation / inferential / evaluation / appreciation; Clymer 1968) and its descendants. Christine Nuttall's Teaching Reading Skills in a Foreign Language (Heinemann 1982; rev. 1996; Macmillan 2005) is the most influential ELT-side text in this lineage, and Day & Park (2005) consolidated the field into a six-type framework — literal, reorganisation, inference, prediction, evaluation, personal response — explicitly building on Pearson & Johnson (1972) and Nuttall (1996). For high-stakes test analysis a simpler three-level reading of Nuttall — literal / reinterpretation / inference — is widely used; a study of Barron's IELTS preparation tests using that frame found 43.8% literal, 43.3% reinterpretation, and only 12.9% inference (Khabbazbashi & Galaczi, cited in subsequent IELTS analyses). The thinness of the inference layer is typical of commercial materials and reflects the production reality that inference items are the hardest to write well.
For an item bank intended to discriminate at upper proficiency, the inference proportion is the cleanest single quality signal.
The distractor
The distractor bank is the discrimination engine. Item Difficulty and Item Discrimination are driven more by distractor quality than by stem or passage. The bedrock reference is Haladyna, Downing & Rodriguez (2002), A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment, a 31-rule taxonomy synthesising 27 textbooks on educational testing with 27 research studies and reviews published since 1990. Their headline findings: one clear problem in the stem, plausible-but-unambiguously-wrong distractors, no negative wording, no grammatical inconsistency between stem and options. Their meta-research also settles the long-running options-count debate: three options are sufficient in most cases. The fourth and fifth typically fail to attract any candidate and add nothing to discrimination.
Plausibility is the operational concept. A falsifiable distractor, one designed to attract a candidate with partial understanding, is what makes an item discriminating. The design rule is that the distractor writer must understand the specific reasoning errors a candidate at the target proficiency would make, then build the option around that error. The 5Ps Distractor Typology (Sun, Yang and Liu 2026) gives item writers a finer vocabulary for what kind of trap a distractor is setting: plausible, peripheral, polyconceptual, prejudicial, and pragmatic. The NLP literature on automatic distractor generation converges on the same properties from a different angle: a good distractor is semantically related to the answer, consistent with the passage context, grammatically clean, and traceable back to text.
How the three surfaces interact
The passage sets a possibility envelope; the question targets a specific sub-skill within it; the distractor controls discrimination. Failures show up as mismatches: a passage too simple to support the question's claimed cognitive level, a question whose key is recoverable from background knowledge alone, a distractor so implausible that the item degrades to a 3-choice or 2-choice item. Item Analysis catches the third failure after the fact; a TLU-grounded specification is what prevents the first two before items are written.
Implications for AI-assisted item generation
Three points carry over directly into pipeline design. First, item specifications must be written before items, with the construct, sub-skill, and TLU domain pinned formally so the generator has somewhere to aim. Second, distractor authoring deserves more attention than question authoring. A strong stem with three weak distractors is a worse item than a mediocre stem with three strong ones, and the bottleneck in AI generation is almost always distractor plausibility rather than stem fluency. Third, inference-level item proportions are the cleanest single quality signal: a generator that cannot reliably produce well-formed inference items is producing a test that will not discriminate at the upper end.
Key References
- Alderson, J. C. (2000). Assessing Reading. Cambridge University Press.
- Bachman, L. F. & Palmer, A. S. (1996). Language Testing in Practice: Designing and Developing Useful Language Tests. Oxford University Press.
- Clymer, T. (1968). What is "reading"?: Some current concepts. In H. M. Robinson (ed.), Innovations and Change in Reading Instruction, 67th Yearbook of the National Society for the Study of Education, Part II. University of Chicago Press.
- Day, R. R. & Park, J. (2005). Developing reading comprehension questions. Reading in a Foreign Language, 17(1), 60–73.
- Grellet, F. (1981). Developing Reading Skills: A Practical Guide to Reading Comprehension Exercises. Cambridge University Press.
- Haladyna, T. M., Downing, S. M. & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309–334.
- Hughes, A. (2003). Testing for Language Teachers (2nd ed.). Cambridge University Press.
- Nuttall, C. (1996). Teaching Reading Skills in a Foreign Language (rev. ed.). Heinemann.
- Pearson, P. D. & Johnson, D. D. (1972). Teaching Reading Comprehension. Holt, Rinehart and Winston.
- Sun, Y., Yang, Y. & Liu, X. (2026). Proposing the 5Ps typology of distractors for EFL multiple-choice reading comprehension tests. Higher Education Studies, 16(1).
See Also
- Listening Comprehension Test Design: the spoken-input counterpart, sharing the test-design apparatus but adding real-time processing and modality-specific failure modes
- Test Specifications: the upstream document that disciplines passage, question, and distractor choices
- Target Language Use Domain: Bachman & Palmer's framework for grounding specifications in real-world use
- Distractor: the discrimination engine of MCQ-format reading items
- Reading Subskills: the inventory of sub-skills that question stems target
- Item Analysis: post-administration evidence that closes the design loop