Consequential Validity

Assessment

Consequential validity concerns the social consequences of test use — both intended and unintended. It asks not just "Does this test measure what it claims to measure?" but "What happens as a result of using this test? Who benefits? Who is harmed?"

The concept was introduced by Samuel Messick (1989) as part of his unified theory of validity. Messick argued that validity is not a property of the test itself but of the interpretations and uses of test scores. Consequences are an integral part of the validity argument, not an afterthought.

Messick's Validity Framework

Messick (1989) proposed a 2×2 matrix:

	Evidential basis	Consequential basis
Test interpretation	Construct validity	Value implications
Test use	Relevance/utility	Social consequences

Traditional validity focuses on the evidential basis — does the test measure the right construct? Consequential validity extends to the right-hand column — what values does the test embody, and what effects does its use produce?

What Counts as a Consequence?

Intended consequences

Students who reach the required proficiency level gain access to university education
Teachers focus on communicative skills because the test rewards them
Institutions identify students who need additional support

Unintended consequences

A writing test that rewards formulaic structure produces learners who write to templates rather than thinking critically
A high-stakes test creates an industry of expensive preparation courses that disadvantage poorer candidates
A placement test systematically underplaces students from certain L1 backgrounds due to cultural bias in task topics
Test anxiety undermines the performance of otherwise capable candidates

The Debate

Consequential validity is the most contested aspect of Messick's framework. Critics (e.g., Popham, 1997; Mehrens, 1997) argue that:

Consequences are a matter of test use, not test validity — a valid test can be misused
Holding test developers responsible for all social consequences is unreasonable
The concept blurs the distinction between technical quality and political/ethical judgment

Defenders (including Messick himself) counter that:

Validity has always been about the appropriateness of inferences and actions based on scores
Ignoring consequences allows harmful tests to hide behind technical adequacy
Test developers who publish a test bear some responsibility for foreseeable effects of its use

Implications for Language Testing

Context	Consequential concern
IELTS for immigration	Visa refusal based on 0.5 band difference — is the measurement precise enough for this decision?
National exit exams	Does the test narrow the curriculum? Does preparation access vary by socioeconomic status?
Placement testing	Does the test systematically place certain groups lower, limiting their access to higher-level instruction?
Washback	Does the test shape teaching in positive or negative ways?

In Practice

Test developers and users can address consequential validity by:

Monitoring washback — Researching how the test affects teaching and learning
Analysing differential performance — Checking whether the test functions differently across groups (gender, L1, SES)
Reviewing score use — Ensuring scores are used for purposes the test was designed for
Providing guidance — Publishing clear guidelines on appropriate score interpretation and use
Revisiting decisions — Reviewing high-stakes decisions regularly to check for unintended patterns

Key References

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13–103). American Council on Education/Macmillan.
Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3), 241–256.
McNamara, T. & Roever, C. (2006). Language Testing: The Social Dimension. Blackwell.