Reliability

Reliability is the consistency of test results. A reliable test produces the same (or very similar) results under the same conditions — same test, same learners, same context. If a learner scores 7.0 on Monday and 5.5 on Wednesday with no change in ability, the test is unreliable.

Types of Reliability

Test-Retest Reliability

Same test, same people, different occasions. If results are consistent, the test is reliable over time. Practically difficult to measure because learners may remember items, improve between sittings, or be affected by different conditions.

Inter-Rater Reliability

Different markers score the same performance. Critical for subjectively scored tasks (writing, speaking). If Rater A gives a 6 and Rater B gives a 4 for the same essay, inter-rater reliability is low. Solutions: detailed rubrics, rater training, standardization sessions, double marking.

Intra-Rater Reliability

The same marker scores the same performance consistently over time. A marker who gives harsher scores at the end of a long marking session shows low intra-rater reliability (rater fatigue). Solutions: breaks, re-marking samples, benchmark scripts.

Internal Consistency

Do the items on the test measure the same construct? If items are supposed to test reading ability, they should correlate with each other. Measured statistically (Cronbach's alpha, KR-20). Low internal consistency suggests the test is measuring multiple unrelated things.

Factors Affecting Reliability

Factor	Effect	Solution
Too few items	Small sample of behavior = more random variation	Increase number of items/tasks
Ambiguous items	Different interpretations = inconsistent responses	Pilot and revise items
Subjective scoring	Marker variability	Rubrics, training, standardization
Test conditions	Noise, timing, unclear instructions	Standardize administration
Rater fatigue	Scores drift over time	Breaks, benchmarking, moderation
Limited range of difficulty	All items too easy or too hard = scores cluster	Include items across difficulty levels

Reliability vs Validity

Reliability is necessary but not sufficient for Validity. A broken thermometer that always reads 38°C is perfectly reliable but completely invalid as a measure of actual temperature. In testing terms: a grammar multiple-choice test may produce very consistent scores (high reliability) but be invalid as a measure of communicative ability.

The tension often runs the other way in practice: highly valid tasks (extended writing, free conversation) tend to be less reliable because they introduce more scorer variability. This is the fundamental trade-off in assessment design. The solution is not to avoid valid task types but to invest in the scoring infrastructure (rubrics, training, moderation) that makes them reliable.

Practical Implications

For classroom teachers: Perfect reliability is not necessary for low-stakes assessments, but systematic inconsistencies (e.g., grading more harshly when tired) should be avoided. Use rubrics even for informal assessments.
For high-stakes tests: Reliability must be formally measured and reported. Double marking, statistical analysis, and standardization are essential.
For test design: More items and tasks = higher reliability. A 10-item quiz is inherently less reliable than a 50-item test. But more items means less Practicality — the trade-off must be managed.