Reliability
Reliability is the consistency of test results. A reliable test produces the same (or very similar) results under the same conditions — same test, same learners, same context. If a learner scores 7.0 on Monday and 5.5 on Wednesday with no change in ability, the test is unreliable.
Types of Reliability
Test-Retest Reliability
Same test, same people, different occasions. If results are consistent, the test is reliable over time. Practically difficult to measure because learners may remember items, improve between sittings, or be affected by different conditions.
Inter-Rater Reliability
Different markers score the same performance. Critical for subjectively scored tasks (writing, speaking). If Rater A gives a 6 and Rater B gives a 4 for the same essay, inter-rater reliability is low. Solutions: detailed rubrics, rater training, standardization sessions, double marking.
Intra-Rater Reliability
The same marker scores the same performance consistently over time. A marker who gives harsher scores at the end of a long marking session shows low intra-rater reliability (rater fatigue). Solutions: breaks, re-marking samples, benchmark scripts.
Internal Consistency
Do the items on the test measure the same construct? If items are supposed to test reading ability, they should correlate with each other. Measured statistically (Cronbach's alpha, KR-20). Low internal consistency suggests the test is measuring multiple unrelated things.
Factors Affecting Reliability
| Factor | Effect | Solution |
|---|---|---|
| Too few items | Small sample of behavior = more random variation | Increase number of items/tasks |
| Ambiguous items | Different interpretations = inconsistent responses | Pilot and revise items |
| Subjective scoring | Marker variability | Rubrics, training, standardization |
| Test conditions | Noise, timing, unclear instructions | Standardize administration |
| Rater fatigue | Scores drift over time | Breaks, benchmarking, moderation |
| Limited range of difficulty | All items too easy or too hard = scores cluster | Include items across difficulty levels |
Reliability vs Validity
Reliability is necessary but not sufficient for Validity. A broken thermometer that always reads 38°C is perfectly reliable but completely invalid as a measure of actual temperature. In testing terms: a grammar multiple-choice test may produce very consistent scores (high reliability) but be invalid as a measure of communicative ability.
The tension often runs the other way in practice: highly valid tasks (extended writing, free conversation) tend to be less reliable because they introduce more scorer variability. This is the fundamental trade-off in assessment design. The solution is not to avoid valid task types but to invest in the scoring infrastructure (rubrics, training, moderation) that makes them reliable.
Practical Implications
- For classroom teachers: Perfect reliability is not necessary for low-stakes assessments, but systematic inconsistencies (e.g., grading more harshly when tired) should be avoided. Use rubrics even for informal assessments.
- For high-stakes tests: Reliability must be formally measured and reported. Double marking, statistical analysis, and standardization are essential.
- For test design: More items and tasks = higher reliability. A 10-item quiz is inherently less reliable than a 50-item test. But more items means less Practicality — the trade-off must be managed.