Type I and Type II Error

Research Methodologyalpha errorbeta errorfalse positivefalse negativestatistical power

The two ways a statistical test can reach a wrong decision under the Neyman–Pearson framework. Type I error is the false rejection of a true Null Hypothesis — concluding an effect exists when none does. Type II error is the failure to reject a false null — missing a real effect. Their probabilities are conventionally written α and β.

Definitions

α is the probability of a Type I error, set in advance as the significance level — typically .05 in applied linguistics. The chosen α directly controls the p-value threshold for "rejecting" H₀. β is the probability of a Type II error, depending on the true Effect Size, sample size, α, and the variability of the measure. Statistical power is 1 − β, the probability of detecting a real effect when one exists.

The Trade-off

For a fixed sample size, lowering α (making the test more conservative) raises β. Raising sample size lowers β at any α and is the only way to reduce both error rates simultaneously. Cohen (1988) recommends a power of .80 as a working minimum — a 20% Type II error rate — and provides tables linking α, expected effect size, and required N for the standard tests.

Power in SLA Research

Empirical reviews of L2 research repeatedly find that typical published studies are underpowered against the medium-sized effects they claim to detect. Plonsky and Oswald (2014), summarising the L2 literature, note that small classroom samples — often N < 30 per group — give power well below .50 against effects of d = 0.4. The practical consequence is a literature in which true positives are mixed with both Type I errors (chance findings published because they cleared α) and Type II errors (real effects that did not reach significance and went unpublished). Replication and meta-analysis are the corrective.

Decision Costs

Type I and Type II errors carry different costs depending on context. In language-testing research, a Type I error on a fairness analysis (concluding a test is biased when it is not) may invalidate a sound assessment; a Type II error (missing real bias) lets unfair decisions persist. Naming which error is worse for the question in hand is part of any defensible α and power choice.

References

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.
Plonsky, L., & Oswald, F. L. (2014). How big is "big"? Interpreting effect sizes in L2 research. Language Learning, 64(4), 878–912.
Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.). London: Sage.
Larson-Hall, J. (2016). A Guide to Doing Statistics in Second Language Research Using SPSS and R (2nd ed.). New York: Routledge.

Related Terms