JELV: A Judge of Edit-Level Validity for Evaluation and Automated Reference Expansion in Grammatical Error Correction
- URL: http://arxiv.org/abs/2511.21700v1
- Date: Sun, 16 Nov 2025 05:58:39 GMT
- Title: JELV: A Judge of Edit-Level Validity for Evaluation and Automated Reference Expansion in Grammatical Error Correction
- Authors: Yuhao Zhan, Yuqing Zhang, Jing Yuan, Qixiang Ma, Zhiqi Yang, Yu Gu, Zemin Liu, Fei Wu,
- Abstract summary: We introduce the Judge of Edit-Level Validity (JELV) to validate correction edits from grammaticality, faithfulness, and fluency.<n>Using our proposed human-annotated Pair-wise Edit-level Validity dataset (PEVData) as benchmark, JELV offers two implementations.<n>We apply JELV to filter LLM-generated correction candidates, expanding the BEA19's single-reference dataset containing 38,692 source sentences.
- Score: 22.662896396339107
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing Grammatical Error Correction (GEC) systems suffer from limited reference diversity, leading to underestimated evaluation and restricted model generalization. To address this issue, we introduce the Judge of Edit-Level Validity (JELV), an automated framework to validate correction edits from grammaticality, faithfulness, and fluency. Using our proposed human-annotated Pair-wise Edit-level Validity Dataset (PEVData) as benchmark, JELV offers two implementations: a multi-turn LLM-as-Judges pipeline achieving 90% agreement with human annotators, and a distilled DeBERTa classifier with 85% precision on valid edits. We then apply JELV to reclassify misjudged false positives in evaluation and derive a comprehensive evaluation metric by integrating false positive decoupling and fluency scoring, resulting in state-of-the-art correlation with human judgments. We also apply JELV to filter LLM-generated correction candidates, expanding the BEA19's single-reference dataset containing 38,692 source sentences. Retraining top GEC systems on this expanded dataset yields measurable performance gains. JELV provides a scalable solution for enhancing reference diversity and strengthening both evaluation and model generalization.
Related papers
- VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment [24.492954219955788]
We propose a closed-loop framework designed to navigate the trade-off between fine-tuning and Aligning Large Language Models (LLMs)<n> VISA features a high-precision value detector, a semantic-to-value translator, and a core value-rewriter.<n>Our experiments demonstrate that this approach enables precise control over a model's value expression while maintaining its factual consistency and general capabilities.
arXiv Detail & Related papers (2026-03-05T05:12:26Z) - HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam [63.84155758655084]
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models.<n>We introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy.<n>We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points.
arXiv Detail & Related papers (2026-02-15T02:50:15Z) - Context-Adaptive Requirements Defect Prediction through Human-LLM Collaboration [1.4499356176178066]
We propose a Human-LLM Collaboration (HLC) approach that treats defect prediction as an adaptive process rather than a static classification task.<n>We evaluate this approach using the weak word smell on the QuRE benchmark of 1,266 annotated Mercedes-Benz requirements.
arXiv Detail & Related papers (2026-01-05T10:00:14Z) - EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI [36.91800117379075]
EVADE is a framework for generating and validating explanations to detect errors using large language models.<n>HLV arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation.
arXiv Detail & Related papers (2025-11-12T03:49:05Z) - Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes [2.9195489041890297]
Large Language Models as a Judge (LaaJ) offer a scalable alternative to expert review.<n>Without validation, organizations risk a circular evaluation loop, where unverified LaaJs are used to assess model outputs.<n>We introduce SparseAlign, a formal framework for assessing LaaJ alignment with sparse human-labeled data.
arXiv Detail & Related papers (2025-10-31T07:27:54Z) - Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models [68.92020689188887]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs)<n>Existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation.<n>This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models.
arXiv Detail & Related papers (2025-02-26T04:50:43Z) - Self-Calibrated Listwise Reranking with Large Language Models [137.6557607279876]
Large language models (LLMs) have been employed in reranking tasks through a sequence-to-sequence approach.
This reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets.
We propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking.
arXiv Detail & Related papers (2024-11-07T10:31:31Z) - MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators [53.91199933655421]
Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment.<n>We introduce a universal and training-free framework, $textbfMQM-APE, based on the idea of filtering out non-impactful errors.<n>Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM.
arXiv Detail & Related papers (2024-09-22T06:43:40Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - CLEME: Debiasing Multi-reference Evaluation for Grammatical Error
Correction [32.44051877804761]
Chunk-LEvel Multi-reference Evaluation (CLEME) is designed to evaluate Grammatical Error Correction (GEC) systems in the multi-reference evaluation setting.
We conduct experiments on six English reference sets based on the CoNLL-2014 shared task.
arXiv Detail & Related papers (2023-05-18T08:57:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.