Related papers: FEANEL: A Benchmark for Fine-Grained Error Analysis in K-12 English Writing

FEANEL: A Benchmark for Fine-Grained Error Analysis in K-12 English Writing

URL: http://arxiv.org/abs/2511.22883v1
Date: Fri, 28 Nov 2025 05:17:45 GMT
Title: FEANEL: A Benchmark for Fine-Grained Error Analysis in K-12 English Writing
Authors: Jingheng Ye, Shen Wang, Jiaqi Chen, Hebin Wang, Deqing Zou, Yanyu Zhu, Jiwei Tang, Hai-Tao Zheng, Ruitong Liu, Haoyang Li, Yanfeng Wang, Qingsong Wen,
Abstract summary: We present the Fine-grained Error ANalysis for English learners (FEANEL) Benchmark.<n>The benchmark comprises 1,000 essays written by elementary and secondary school students.<n>Each error is annotated by language education experts and categorized by type, severity, and explanatory feedback, using a part-of-speech-based taxonomy they co-developed.
Score: 68.23874413455594
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have transformed artificial intelligence, offering profound opportunities for educational applications. However, their ability to provide fine-grained educational feedback for K-12 English writing remains underexplored. In this paper, we challenge the error analysis and pedagogical skills of LLMs by introducing the problem of Fine-grained Error Analysis for English Learners and present the Fine-grained Error ANalysis for English Learners (FEANEL) Benchmark. The benchmark comprises 1,000 essays written by elementary and secondary school students, and a well-developed English writing error taxonomy. Each error is annotated by language education experts and categorized by type, severity, and explanatory feedback, using a part-of-speech-based taxonomy they co-developed. We evaluate state-of-the-art LLMs on the FEANEL Benchmark to explore their error analysis and pedagogical abilities. Experimental results reveal significant gaps in current LLMs' ability to perform fine-grained error analysis, highlighting the need for advancements in particular methods for educational applications.

Related papers

A Taxonomy of Errors in English as she is spoke: Toward an AI-Based Method of Error Analysis for EFL Writing Instruction [0.0]
This study describes the development of an AI-assisted error analysis system designed to identify, categorize, and correct writing errors in English.<n>The system employs a detailed taxonomy grounded in linguistic theories from Corder (1967), Richards (1971), and James (1998).<n>The AI successfully identified diverse error types but showed limitations in contextual understanding and occasionally generated new error categories when encountering uncoded errors.
arXiv Detail & Related papers (2025-11-29T08:45:00Z)
Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models [49.1574468325115]
We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities.<n>Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training.<n>Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs.
arXiv Detail & Related papers (2025-05-25T08:37:55Z)
Testing LLMs' Capabilities in Annotating Translations Based on an Error Typology Designed for LSP Translation: First Experiments with ChatGPT [3.9325957466009203]
This study investigates the capabilities of large language models (LLMs) in annotating MT outputs based on an error typology.<n>We compare ChatGPT annotations with human expert evaluations of translations produced by DeepL and ChatGPT itself.
arXiv Detail & Related papers (2025-04-21T12:21:37Z)
CPG-EVAL: A Multi-Tiered Benchmark for Evaluating the Chinese Pedagogical Grammar Competence of Large Language Models [6.0020878662404975]
This paper introduces the first benchmark specifically designed to evaluate LLMs' knowledge of pedagogical grammar within the context of foreign language instruction.<n>The benchmark comprises five tasks designed to assess grammar recognition, fine-grained grammatical distinction, categorical discrimination, and resistance to linguistic interference.
arXiv Detail & Related papers (2025-04-17T18:01:50Z)
Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English [66.97110551643722]
We investigate dialectal disparities in Large Language Models (LLMs) reasoning tasks.<n>We find that LLMs produce less accurate responses and simpler reasoning chains and explanations for AAE inputs.<n>These findings highlight systematic differences in how LLMs process and reason about different language varieties.
arXiv Detail & Related papers (2025-03-06T05:15:34Z)
JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [51.99046112135311]
We introduce JustLogic, a synthetically generated deductive reasoning benchmark for rigorous evaluation of Large Language Models (LLMs)<n>JustLogic is highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures.<n>Our experimental results reveal that (i) state-of-the-art (SOTA) reasoning LLMs perform on par or better than the human average but significantly worse than the human ceiling.
arXiv Detail & Related papers (2025-01-24T15:49:10Z)
When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages [9.138590152838754]
Segment-level quality estimation (QE) is a challenging cross-lingual language understanding task.<n>We comprehensively evaluate large language models (LLMs) in zero/few-shot scenarios.<n>Our results indicate that prompt-based approaches are outperformed by the encoder-based fine-tuned QE models.
arXiv Detail & Related papers (2025-01-08T12:54:05Z)
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do [2.2469442203227863]
We conduct a comprehensive analysis of automated evaluators, reporting several key findings on their behavior. We discover that English evaluation capabilities significantly influence language-specific evaluation capabilities, enabling evaluators trained in English to easily transfer their skills to other languages. We find that state-of-the-art evaluators struggle with challenging prompts, in either English or Korean, underscoring their limitations in assessing or generating complex reasoning questions.
arXiv Detail & Related papers (2024-09-17T14:40:02Z)
How Does Quantization Affect Multilingual LLMs? [50.867324914368524]
Quantization techniques are widely used to improve inference speed and deployment of large language models. We conduct a thorough analysis of quantized multilingual LLMs, focusing on performance across languages and at varying scales.
arXiv Detail & Related papers (2024-07-03T15:39:40Z)
LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z)
Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models [51.75805497456226]
This work focuses on the factual consistency issue with the help of the dialogue summarization task. Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. To stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data.
arXiv Detail & Related papers (2023-11-13T09:32:12Z)
xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection [21.116517555282314]
xCOMET is an open-source learned metric designed to bridge the gap between machine translation evaluation approaches. It integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation. We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations.
arXiv Detail & Related papers (2023-10-16T15:03:14Z)
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level. We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt) This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.