Performance Assessment
Performance assessment requires learners to do something — speak, write, demonstrate, create — rather than select from pre-determined options. The response is constructed, not chosen, and is evaluated against criteria defined in a rubric or rating scale.
In language testing, performance assessment is most obviously relevant to productive skills (speaking and writing), but it extends to any task where the learner must generate language to accomplish a communicative purpose: summarising a text, participating in a group discussion, giving instructions, writing a formal complaint.
Characteristics
| Feature | Description |
|---|---|
| Constructed response | Learners produce language rather than recognise correct answers |
| Criteria-based scoring | Judged by trained raters using band descriptors, not answer keys |
| Extended performance | Tasks are longer and more complex than discrete-point items |
| Skill integration | Often involves reading-into-writing, listening-into-speaking |
| Direct measurement | Tests the skill itself, not a proxy (e.g., testing writing by having learners write) |
Direct vs Indirect Testing
Performance assessment embodies direct testing — assessing the target skill through actual performance of that skill. This contrasts with indirect testing, where a proxy task is used:
| Direct (Performance) | Indirect |
|---|---|
| Writing an essay → scored for writing ability | Editing sentences → inferred writing ability |
| Giving a presentation → scored for speaking | Pronunciation discrimination task → inferred speaking |
| Listening to a lecture and taking notes → scored for note quality | Multiple-choice listening comprehension |
Direct testing has stronger face validity and construct validity for communicative ability, but is more expensive and harder to score reliably.
Scoring Performance
Performance assessment requires subjective judgment, which introduces variability. Managing this requires:
- Clear rubrics with specific band descriptors
- Rater Training with benchmark samples at each level
- Inter-rater reliability checks — double-marking, statistical monitoring
- Analytic scoring when diagnostic information is needed; holistic scoring when efficiency matters
The scoring method directly affects washback. Analytic rubrics signal to teachers and learners which specific aspects of performance matter; holistic rubrics encourage attention to overall communicative effectiveness.
Task Design
Effective performance tasks:
- Have a clear communicative purpose (not just "write 250 words")
- Specify the audience, context, and expected output
- Are accessible to the target population — task difficulty should come from the language demands, not from unfamiliar content
- Sample the construct adequately — a single writing task cannot represent all of writing ability
- Allow for a range of performance levels — the task should be completable at lower levels but allow stronger candidates to demonstrate higher ability
Limitations
- Resource-intensive — Requires trained raters, standardisation meetings, and more administration time
- Reliability — Inherently lower than objective testing unless rater training and monitoring are rigorous
- Generalisability — Performance on one task may not predict performance on a different task; multiple tasks improve generalisability but increase cost
- Task effects — Topic familiarity, task type, and interlocutor behaviour all affect performance independently of language ability
Key References
- McNamara, T. (1996). Measuring Second Language Performance. Longman.
- Bachman, L. F. & Palmer, A. S. (1996). Language Testing in Practice. Oxford University Press.
- Weigle, S. C. (2002). Assessing Writing. Cambridge University Press.
- Fulcher, G. (2003). Testing Second Language Speaking. Pearson.
See Also
- Authentic Assessment — performance assessment is inherently more authentic than selected-response testing
- Rubric — the scoring instrument for performance assessment
- Rating Scale — the structure that organises scoring criteria
- Productive Skills — the skills most commonly assessed through performance
- Inter-rater Reliability — the key reliability concern for performance assessment