Related papers: DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models

DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models

URL: http://arxiv.org/abs/2412.12832v1
Date: Tue, 17 Dec 2024 11:54:16 GMT
Title: DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models
Authors: Jinxiang Xie, Yilin Li, Xunjian Yin, Xiaojun Wan,
Abstract summary: Large language model (LLM)-based Grammatical Error Correction (GEC) models often produce corrections that diverge from provided gold references.<n>This discrepancy undermines the reliability of traditional reference-based evaluation metrics.<n>We propose a novel evaluation framework for GEC models, DSGram, integrating Semantic Coherence, Edit Level, and Fluency, and utilizing a dynamic weighting mechanism.
Score: 39.493913608472404
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating the performance of Grammatical Error Correction (GEC) models has become increasingly challenging, as large language model (LLM)-based GEC systems often produce corrections that diverge from provided gold references. This discrepancy undermines the reliability of traditional reference-based evaluation metrics. In this study, we propose a novel evaluation framework for GEC models, DSGram, integrating Semantic Coherence, Edit Level, and Fluency, and utilizing a dynamic weighting mechanism. Our framework employs the Analytic Hierarchy Process (AHP) in conjunction with large language models to ascertain the relative importance of various evaluation criteria. Additionally, we develop a dataset incorporating human annotations and LLM-simulated sentences to validate our algorithms and fine-tune more cost-effective models. Experimental results indicate that our proposed approach enhances the effectiveness of GEC model evaluations.

Related papers

Asm2SrcEval: Evaluating Large Language Models for Assembly-to-Source Code Translation [4.45354703148321]
Assembly-to-source code translation is a critical task in reverse engineering, cybersecurity, and software maintenance.<n>We present the first comprehensive evaluation of five state-of-the-art large language models on assembly-to-source translation.
arXiv Detail & Related papers (2025-11-28T12:40:30Z)
A Large-Scale Dataset and Citation Intent Classification in Turkish with LLMs [0.0]
We first present a new, publicly available dataset of Turkish citation intents, created with a purpose-built annotation tool.<n>We then evaluate the performance of standard In-Context Learning with Large Language Models (LLMs), demonstrating that its effectiveness is limited by inconsistent results caused by manually designed prompts.<n>For final classification, we employ a stacked generalization ensemble to aggregate outputs from multiple optimized models, ensuring stable and reliable predictions.
arXiv Detail & Related papers (2025-09-26T05:44:04Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Language Model Preference Evaluation with Multiple Weak Evaluators [78.53743237977677]
GED (Preference Graph Ensemble and Denoise) is a novel approach that leverages multiple model-based evaluators to construct preference graphs. We show that GED outperforms baseline methods in model ranking, response selection, and model alignment tasks.
arXiv Detail & Related papers (2024-10-14T01:57:25Z)
Assessing Generative Language Models in Classification Tasks: Performance and Self-Evaluation Capabilities in the Environmental and Climate Change Domain [0.0]
This paper examines the performance of two Large Language Models (LLMs), GPT3.5 and Llama2 and one Small Language Model (SLM) Gemma, across three different classification tasks within the climate change (CC) and environmental domain.
arXiv Detail & Related papers (2024-08-30T15:52:41Z)
Investigating a Benchmark for Training-set free Evaluation of Linguistic Capabilities in Machine Reading Comprehension [12.09297288867446]
We examine a framework for evaluating optimised models in training-set free setting on synthetically generated challenge sets. We find that despite the simplicity of the generation method, the data can compete with crowd-sourced datasets with regard to naturalness and lexical diversity. We conduct further experiments and show that state-of-the-art language model-based MRC systems can learn to succeed on the challenge set correctly.
arXiv Detail & Related papers (2024-08-09T12:23:36Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)
Revisiting Meta-evaluation for Grammatical Error Correction [14.822205658480813]
SEEDA is a new dataset for GEC meta-evaluation. It consists of corrections with human ratings along two different granularities. The results suggest that edit-based metrics may have been underestimated in existing studies.
arXiv Detail & Related papers (2024-03-05T05:53:09Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
An Examination of the Compositionality of Large Generative Vision-Language Models [7.639748270719836]
Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. In this paper, we examine both the evaluation metrics (VisualGPTScore, etc.) and current benchmarks for evaluating the compositionality of GVLMs. We identify the syntactical bias in current benchmarks, which is exploited by the linguistic capability of GVLMs.
arXiv Detail & Related papers (2023-08-21T06:50:29Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [44.22820310679188]
Large language models achieve state-of-the-art performance on sequence generation evaluation, but typically have a large number of parameters. We propose textbfECT, an textbfevaluation textbfcapability textbftransfer method, to transfer the evaluation capability from LLMs to relatively lightweight language models. Based on the proposed ECT, we learn various evaluation models from ChatGPT, and employ them as reward models to improve sequence generation models.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
Rethinking Masked Language Modeling for Chinese Spelling Correction [70.85829000570203]
We study Chinese Spelling Correction (CSC) as a joint decision made by two separate models: a language model and an error model. We find that fine-tuning BERT tends to over-fit the error model while under-fit the language model, resulting in poor generalization to out-of-distribution error patterns. We demonstrate that a very simple strategy, randomly masking 20% non-error tokens from the input sequence during fine-tuning is sufficient for learning a much better language model without sacrificing the error model.
arXiv Detail & Related papers (2023-05-28T13:19:12Z)
Discover, Explanation, Improvement: An Automatic Slice Detection Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints. This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks. Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction [36.74272211767197]
We propose a linguistic rules-based approach to construct large-scale CGEC training corpora with automatically generated grammatical errors. We present a challenging CGEC benchmark derived entirely from errors made by native Chinese speakers in real-world scenarios.
arXiv Detail & Related papers (2022-10-19T10:20:39Z)
A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction [54.569707226277735]
Existing approaches for grammatical error correction (GEC) rely on supervised learning with manually created GEC datasets. There is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected. We propose a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models.
arXiv Detail & Related papers (2020-10-07T04:45:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.