Performance Assessment

Assessment

Performance assessment requires learners to do something — speak, write, demonstrate, create — rather than select from pre-determined options. The response is constructed, not chosen, and is evaluated against criteria defined in a rubric or rating scale.

In language testing, performance assessment is most obviously relevant to productive skills (speaking and writing), but it extends to any task where the learner must generate language to accomplish a communicative purpose: summarising a text, participating in a group discussion, giving instructions, writing a formal complaint.

Characteristics

Feature	Description
Constructed response	Learners produce language rather than recognise correct answers
Criteria-based scoring	Judged by trained raters using band descriptors, not answer keys
Extended performance	Tasks are longer and more complex than discrete-point items
Skill integration	Often involves reading-into-writing, listening-into-speaking
Direct measurement	Tests the skill itself, not a proxy (e.g., testing writing by having learners write)

Direct vs Indirect Testing

Performance assessment embodies direct testing — assessing the target skill through actual performance of that skill. This contrasts with indirect testing, where a proxy task is used:

Direct (Performance)	Indirect
Writing an essay → scored for writing ability	Editing sentences → inferred writing ability
Giving a presentation → scored for speaking	Pronunciation discrimination task → inferred speaking
Listening to a lecture and taking notes → scored for note quality	Multiple-choice listening comprehension

Direct testing has stronger face validity and construct validity for communicative ability, but is more expensive and harder to score reliably.

Scoring Performance

Performance assessment requires subjective judgment, which introduces variability. Managing this requires:

Clear rubrics with specific band descriptors
Rater Training with benchmark samples at each level
Inter-rater reliability checks — double-marking, statistical monitoring
Analytic scoring when diagnostic information is needed; holistic scoring when efficiency matters

The scoring method directly affects washback. Analytic rubrics signal to teachers and learners which specific aspects of performance matter; holistic rubrics encourage attention to overall communicative effectiveness.

Task Design

Effective performance tasks:

Have a clear communicative purpose (not just "write 250 words")
Specify the audience, context, and expected output
Are accessible to the target population — task difficulty should come from the language demands, not from unfamiliar content
Sample the construct adequately — a single writing task cannot represent all of writing ability
Allow for a range of performance levels — the task should be completable at lower levels but allow stronger candidates to demonstrate higher ability

Limitations

Resource-intensive — Requires trained raters, standardisation meetings, and more administration time
Reliability — Inherently lower than objective testing unless rater training and monitoring are rigorous
Generalisability — Performance on one task may not predict performance on a different task; multiple tasks improve generalisability but increase cost
Task effects — Topic familiarity, task type, and interlocutor behaviour all affect performance independently of language ability

Key References

McNamara, T. (1996). Measuring Second Language Performance. Longman.
Bachman, L. F. & Palmer, A. S. (1996). Language Testing in Practice. Oxford University Press.
Weigle, S. C. (2002). Assessing Writing. Cambridge University Press.
Fulcher, G. (2003). Testing Second Language Speaking. Pearson.