Validity

Validity is the degree to which a test measures what it claims to measure. It is the most important quality of any assessment — a test that does not measure what it purports to is useless, regardless of how reliable or practical it is.

Modern testing theory (Messick 1989) treats validity as a unitary concept: there is not a set of separate "types" but rather different sources of evidence that support or undermine the overall validity argument. However, the traditional categories remain useful for thinking through test quality.

Sources of Validity Evidence

Construct Validity

Does the test actually measure the theoretical construct it targets? A "reading comprehension" test that can be answered from general knowledge without reading the passage has weak construct validity. A speaking test where learners read aloud from a script does not measure spontaneous speaking ability.

Construct validity is the overarching concern — all other "types" contribute evidence for or against it.

Content Validity

Does the test adequately sample the content domain it claims to cover? A listening test that only uses monologues and never dialogues under-represents the construct of listening ability. Content validity is established through expert judgment and specification matching, not statistics.

Face Validity

Does the test look like a legitimate test of what it claims to measure, to the people taking it? Face validity is technically not a "real" type of validity, but it matters pragmatically. A speaking test that feels like a speaking test motivates engagement; one that feels irrelevant breeds resentment and reduced effort.

Does test performance correlate with other measures of the same ability?

Concurrent validity: Does it correlate with existing accepted tests? (e.g., a new placement test vs. IELTS scores)
Predictive validity: Does it predict future performance? (e.g., do placement test scores predict end-of-course achievement?)

The Validity-Reliability Relationship

Reliability is a necessary but not sufficient condition for validity. A test can be perfectly reliable (consistent results every time) but completely invalid (consistently measuring the wrong thing). A grammar test given as a measure of speaking ability may be highly reliable but invalid for its stated purpose.

Conversely, a test cannot be valid if it is not reliable — inconsistent measurement cannot be accurate measurement.

Threats to Validity

Threat	Example
Construct underrepresentation	A writing test that only assesses grammar, ignoring coherence, task achievement, and vocabulary
Construct-irrelevant variance	A reading test where scores depend heavily on background knowledge of the topic rather than reading ability
Method effects	Multiple-choice format allowing test-wise strategies unrelated to language ability
Bias	Test content that advantages certain cultural or gender groups
Washback misalignment	The test measures skills that do not align with the learning objectives

Why It Matters

Every assessment decision should start with: What am I trying to measure, and does this test actually measure it?
Validity is an argument, not a number. You build a case for validity by accumulating evidence and addressing threats.
In classroom contexts, validity often comes down to: Does this test reflect what I actually taught and what learners actually need to be able to do?
High-stakes decisions (placement, certification, graduation) demand stronger validity evidence than low-stakes classroom quizzes.