CEFR Text-Level Classification

AssessmentLanguage AnalysisMethodologyCEFR text classifierCEFR level predictionautomatic CEFR classificationCEFR scoring

The task of mapping a piece of running text to a single CEFR band (A1, A2, B1, B2, C1, C2). Distinct from vocabulary profiling, which tags individual lexical items by level. A lexical profile tells you what proportion of a text's vocabulary sits at each CEFR band; a CEFR text-level classifier tells you what band the text as a whole belongs in.

The two tasks are routinely confused, including by tool vendors. They use overlapping evidence (vocabulary distribution is the strongest single feature in any CEFR text classifier) but answer different questions and require different validation. A B2 text can be lexically C1 and grammatically B1, or vice versa. The level of the text is a holistic judgement; the level of the words is a distribution.

Why a wordlist alone cannot classify text

Vocabulary inventories — CEFR-J, the English Vocabulary Profile, CEFR-J + Octanove — are resources, not classifiers. They map word forms to CEFR labels. To convert that mapping into a single text-level call, a tool must add a heuristic on top: "the level at which 95% of tokens are covered," "the modal level of content words," "the level above which fewer than 5% of tokens fall." Every such heuristic is a defensible choice and an under-determined one. CVLA, the official CEFR-J text-level analyser, uses a six-feature stable mean over CEFR-J wordlist coverage. Other implementations apply percentile or modal rules. The choice is honest only when the tool documents it.

Trained classifiers go further. Instead of a fixed rule, they learn from CEFR-labelled texts which combination of features (vocabulary distribution, sentence length, syntactic depth, discourse markers, error rates) best predicts human-assigned levels. The output is still calibrated to one specific corpus and one specific labeller pool, but the calibration is empirical rather than imposed.

The free, open landscape

Three open systems span the practical accuracy range, each with a different tradeoff.

Feature-based XGBoost. Adam Montgomerie's CEFR-English-Level-Predictor uses engineered features (SMOG, Dale-Chall, FKGL, spaCy parse depth, POS distributions) on a small CEFR-labelled corpus. Reports 70.9% exact accuracy and 95% within-one-level on its test set. Document-level, fully open-source, no GPU needed, sub-second inference. The natural integration target for any tool that already extracts these features.

Fine-tuned transformer. UniversalCEFR (EMNLP 2025) trained xlm-roberta-base and ModernBERT-base classifiers on a 505K-text multilingual CEFR corpus (13 languages, English the largest slice). Higher ceiling than feature-based methods on long texts; per-text inference time in the seconds range on CPU. Constituent training data carries mixed licences; the released classifier weights themselves are usable but should be cited per the EMNLP 2025 paper.

Sentence-level transformer. Yuki Arase, Satoru Uchida, and Tomoyuki Kajiwara's CEFR-SP (EMNLP 2022) trained BERT on 17K professionally annotated English sentences. 84.5% macro-F1 — the best published English CEFR figure — but operates at the sentence unit. To produce a passage-level call, results must be aggregated, and the aggregation policy is itself a design decision.

The commercial leader, Text Inspector's Scorecard / Lexical Profile©, is closed but transparent about the same limits the open systems have. The Text Inspector documentation explicitly states the Lexical Profile is an estimate not to be used without other factors, is calibrated only on writing/reading/listening (not speaking), assesses only vocabulary and metadiscourse, and uses a limited feature set. These caveats apply to every CEFR text classifier currently in production.

Universal caveats

Five limits hold across every CEFR text classifier, free or commercial:

Adjacent-level confusion is irreducible. Most misclassifications fall on the next band up or down. Even professional human raters disagree at this granularity (inter-rater Cohen's κ ≈ 0.5–0.7 on 6-class CEFR), so 100% accuracy is not the target. Within-one-level reliability is.
Domain and genre drift the calibration. A classifier trained on academic prose under-rates conversational text; one trained on learner writing over-rates clean expert writing. No public classifier conditions on genre.
Vocabulary dominates the signal. Lexical features carry 60–80% of the predictive weight in feature-based models, which means the classifiers inherit the limits of Lexical Sophistication: they undervalue syntactic complexity, discourse organisation, and topical demand.
Population calibration matters. CEFR-J-based tools are anchored to Japanese learners of English; European-anchored CEFR resources differ on borderline items. For IELTS work the European anchor is closer to the test population.
A single classifier is uninformative; convergence is diagnostic. The validation logic from Text Metric Implementation Variance applies. When two independent classifiers (a feature-based and a transformer-based) agree on a text, the call is robust; when they disagree, the disagreement itself is the signal that the text sits between bands or that one feature family is being misled.

Validation, not parity

The temptation when building a CEFR classifier is to chase exact-match accuracy against a reference tool. This is the same category error catalogued in Text Metric Implementation Variance: different classifiers measure the same construct through different evidence and should disagree by characteristic margins. The right validation is rank correlation across a calibrated corpus: given ten texts of known CEFR level, does the classifier rank them in roughly the right order?

For IELTS reading-passage sourcing the practical use is filtering, not labelling. Run the classifier across candidate passages, accept the within-one-level call as a first cut, and reserve the absolute level decision for a human rater who can see the things classifiers cannot: discourse organisation, topic familiarity, register fit. The classifier earns its keep by removing the obvious mismatches and concentrating human review on the borderline cases.

References

Imperial, J. M. et al. (2025). UniversalCEFR: Enabling open multilingual research on language proficiency assessment. Proceedings of EMNLP 2025.
Arase, Y., Uchida, S. & Kajiwara, T. (2022). CEFR-Based sentence difficulty annotation and assessment. Proceedings of EMNLP 2022, 6206–6219.
Uchida, S. & Negishi, M. (2018). Assigning CEFR-J levels to English texts: The CVLA approach. CEFR-J Journal, 1, 39–52.
Montgomerie, A. (2021). Predicting the CEFR Level of English Texts. Online article and open-source XGBoost implementation.
Khallaf, N. & Sharoff, S. (2022). Towards CEFR-graded text classification: A survey of methods and corpora. Proceedings of LREC 2022.

Related Terms

CEFR Text-Level Classification

AssessmentLanguage AnalysisMethodologyCEFR text classifierCEFR level predictionautomatic CEFR classificationCEFR scoring

Why a wordlist alone cannot classify text

The free, open landscape

Three open systems span the practical accuracy range, each with a different tradeoff.

Universal caveats

Five limits hold across every CEFR text classifier, free or commercial:

Adjacent-level confusion is irreducible. Most misclassifications fall on the next band up or down. Even professional human raters disagree at this granularity (inter-rater Cohen's κ ≈ 0.5–0.7 on 6-class CEFR), so 100% accuracy is not the target. Within-one-level reliability is.
Domain and genre drift the calibration. A classifier trained on academic prose under-rates conversational text; one trained on learner writing over-rates clean expert writing. No public classifier conditions on genre.
Vocabulary dominates the signal. Lexical features carry 60–80% of the predictive weight in feature-based models, which means the classifiers inherit the limits of Lexical Sophistication: they undervalue syntactic complexity, discourse organisation, and topical demand.
Population calibration matters. CEFR-J-based tools are anchored to Japanese learners of English; European-anchored CEFR resources differ on borderline items. For IELTS work the European anchor is closer to the test population.
A single classifier is uninformative; convergence is diagnostic. The validation logic from Text Metric Implementation Variance applies. When two independent classifiers (a feature-based and a transformer-based) agree on a text, the call is robust; when they disagree, the disagreement itself is the signal that the text sits between bands or that one feature family is being misled.

Validation, not parity

References

Imperial, J. M. et al. (2025). UniversalCEFR: Enabling open multilingual research on language proficiency assessment. Proceedings of EMNLP 2025.
Arase, Y., Uchida, S. & Kajiwara, T. (2022). CEFR-Based sentence difficulty annotation and assessment. Proceedings of EMNLP 2022, 6206–6219.
Uchida, S. & Negishi, M. (2018). Assigning CEFR-J levels to English texts: The CVLA approach. CEFR-J Journal, 1, 39–52.
Montgomerie, A. (2021). Predicting the CEFR Level of English Texts. Online article and open-source XGBoost implementation.
Khallaf, N. & Sharoff, S. (2022). Towards CEFR-graded text classification: A survey of methods and corpora. Proceedings of LREC 2022.

CEFR Text-Level Classification

Why a wordlist alone cannot classify text

The free, open landscape

Universal caveats

Validation, not parity

References

See Also

Related Terms

CEFR Text-Level Classification

Why a wordlist alone cannot classify text

The free, open landscape

Universal caveats

Validation, not parity

References

See Also

Related Terms