ELTiverse

Search Terms

Search for ELT terms and concepts

Performance Assessment

Assessment

Performance assessment requires learners to do something — speak, write, demonstrate, create — rather than select from pre-determined options. The response is constructed, not chosen, and is evaluated against criteria defined in a rubric or rating scale.

In language testing, performance assessment is most obviously relevant to productive skills (speaking and writing), but it extends to any task where the learner must generate language to accomplish a communicative purpose: summarising a text, participating in a group discussion, giving instructions, writing a formal complaint.

Characteristics

FeatureDescription
Constructed responseLearners produce language rather than recognise correct answers
Criteria-based scoringJudged by trained raters using band descriptors, not answer keys
Extended performanceTasks are longer and more complex than discrete-point items
Skill integrationOften involves reading-into-writing, listening-into-speaking
Direct measurementTests the skill itself, not a proxy (e.g., testing writing by having learners write)

Direct vs Indirect Testing

Performance assessment embodies direct testing — assessing the target skill through actual performance of that skill. This contrasts with indirect testing, where a proxy task is used:

Direct (Performance)Indirect
Writing an essay → scored for writing abilityEditing sentences → inferred writing ability
Giving a presentation → scored for speakingPronunciation discrimination task → inferred speaking
Listening to a lecture and taking notes → scored for note qualityMultiple-choice listening comprehension

Direct testing has stronger face validity and construct validity for communicative ability, but is more expensive and harder to score reliably.

Scoring Performance

Performance assessment requires subjective judgment, which introduces variability. Managing this requires:

  1. Clear rubrics with specific band descriptors
  2. Rater Training with benchmark samples at each level
  3. Inter-rater reliability checks — double-marking, statistical monitoring
  4. Analytic scoring when diagnostic information is needed; holistic scoring when efficiency matters

The scoring method directly affects washback. Analytic rubrics signal to teachers and learners which specific aspects of performance matter; holistic rubrics encourage attention to overall communicative effectiveness.

Task Design

Effective performance tasks:

  • Have a clear communicative purpose (not just "write 250 words")
  • Specify the audience, context, and expected output
  • Are accessible to the target population — task difficulty should come from the language demands, not from unfamiliar content
  • Sample the construct adequately — a single writing task cannot represent all of writing ability
  • Allow for a range of performance levels — the task should be completable at lower levels but allow stronger candidates to demonstrate higher ability

Limitations

  • Resource-intensive — Requires trained raters, standardisation meetings, and more administration time
  • Reliability — Inherently lower than objective testing unless rater training and monitoring are rigorous
  • Generalisability — Performance on one task may not predict performance on a different task; multiple tasks improve generalisability but increase cost
  • Task effects — Topic familiarity, task type, and interlocutor behaviour all affect performance independently of language ability

Key References

  • McNamara, T. (1996). Measuring Second Language Performance. Longman.
  • Bachman, L. F. & Palmer, A. S. (1996). Language Testing in Practice. Oxford University Press.
  • Weigle, S. C. (2002). Assessing Writing. Cambridge University Press.
  • Fulcher, G. (2003). Testing Second Language Speaking. Pearson.

See Also

Related Terms