Construct Validity
Construct validity is the extent to which a test actually measures the theoretical ability (construct) it claims to measure. It is the central, unifying concept in modern validity theory — Messick (1989) argued that all validity is ultimately construct validity, with other "types" (content, criterion-related) providing different sources of evidence for the construct validity argument.
The Core Question
Does a test score mean what we say it means?
If a test is labelled "reading comprehension," construct validity asks: Do the scores genuinely reflect reading comprehension ability, or do they reflect something else — background knowledge, test-taking strategy, vocabulary size, or working memory capacity?
Two Primary Threats
Messick (1989) identified two threats that undermine construct validity:
Construct Underrepresentation (CU)
The test is too narrow — it fails to capture important aspects of the construct. Examples:
- A "writing ability" test that only requires sentence-level grammar manipulation (no extended composition, no coherence demands, no audience awareness)
- A "speaking" test where candidates read aloud from a script (no spontaneous production, no interaction)
- An IELTS Writing test that only used argumentative essays would underrepresent the writing construct — hence the inclusion of both Task 1 (data description) and Task 2 (argumentation)
Construct-Irrelevant Variance (CIV)
The test scores are influenced by factors outside the target construct. Two subtypes:
- Construct-irrelevant difficulty — Features that make the test harder for reasons unrelated to the construct. A reading test with culturally specific content that some candidates lack background knowledge for; a writing test with severe time pressure that penalises slow but competent writers.
- Construct-irrelevant easiness — Features that inflate scores artificially. Predictable item formats that allow test-wise elimination strategies; answer options that can be identified without understanding the passage; templates that score well without genuine language ability.
How Construct Validity Differs from Content and Face Validity
| Aspect | Construct validity | Content Validity | Face Validity |
|---|---|---|---|
| Focus | Does it measure the right ability? | Does it sample the domain adequately? | Does it look right? |
| Evidence | Statistical, theoretical, logical | Expert judgment, specification matching | Stakeholder impression |
| Status | The overarching validity concept | One source of evidence for construct validity | Not technically validity |
| Example | Factor analysis shows speaking scores load on one dimension, not separate grammar/fluency factors | The test covers all four language skills proportionally | Test-takers feel the speaking test is a fair test of speaking |
Content validity and face validity are contributory evidence for construct validity, not separate concepts at the same level. A test with strong content validity (good domain sampling) and strong face validity (stakeholder acceptance) has some evidence supporting construct validity — but not conclusive evidence.
Methods of Investigation
Correlation studies. If a test measures reading comprehension, scores should correlate with other established measures of reading comprehension (convergent validity) and correlate less strongly with measures of unrelated abilities (discriminant validity). The classic framework is Campbell & Fiske's (1959) multitrait-multimethod matrix.
Factor analysis. Statistical analysis of score patterns reveals whether the test measures one ability or several. If a "reading test" turns out to measure vocabulary knowledge and reading speed as separate factors, this informs the construct definition.
Differential item functioning (DIF). Analysis of whether items function differently for subgroups (gender, L1, cultural background). If an item is significantly harder for one group than another at the same ability level, construct-irrelevant variance is present.
Think-aloud protocols. Observing what test-takers actually do when answering items — are they using the target skill or something else? If "reading comprehension" items can be answered through vocabulary matching without actual comprehension, the construct is undermined.
Expert judgment. Subject matter experts evaluate whether test tasks require the target ability. This overlaps with content validation but focuses specifically on the cognitive processes demanded.
Why It Matters
Construct validity is not an abstract concern — it has direct consequences for teaching and learning:
- A test with weak construct validity misclassifies learners. Scores do not reflect actual ability, so decisions based on those scores (placement, certification, progression) are wrong.
- Weak construct validity produces negative washback. If a reading test can be beaten through vocabulary memorisation, that is what teachers will teach — and learners will not develop reading ability.
- Every rating scale is an operationalisation of a construct. If the IELTS Writing descriptors emphasise features that do not actually distinguish good writing from poor writing, the construct is poorly defined.
For test developers and teachers alike, the question is always: Am I measuring what I think I am measuring, or am I measuring something else?
Key References
- Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13-103). Macmillan.
- Bachman, L. F. (1990). Fundamental Considerations in Language Testing. Oxford University Press.
- Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice. Oxford University Press.
- McNamara, T. (1996). Measuring Second Language Performance. Longman.
- Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81-105.
- Hughes, A. (2003). Testing for Language Teachers (2nd ed.). Cambridge University Press.