gec-metrics: A Unified Library for Grammatical Error Correction Evaluation
- URL: http://arxiv.org/abs/2505.19388v1
- Date: Mon, 26 May 2025 01:10:16 GMT
- Title: gec-metrics: A Unified Library for Grammatical Error Correction Evaluation
- Authors: Takumi Goto, Yusuke Sakai, Taro Watanabe,
- Abstract summary: gec-metrics is a library for using and developing grammatical error correction (GEC) evaluation metrics.<n>Our library enables fair system comparisons by ensuring that everyone conducts evaluations using a consistent implementation.<n>Our code is released under the MIT license and is also distributed as an installable package.
- Score: 13.02513034520894
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce gec-metrics, a library for using and developing grammatical error correction (GEC) evaluation metrics through a unified interface. Our library enables fair system comparisons by ensuring that everyone conducts evaluations using a consistent implementation. Moreover, it is designed with a strong focus on API usage, making it highly extensible. It also includes meta-evaluation functionalities and provides analysis and visualization scripts, contributing to developing GEC evaluation metrics. Our code is released under the MIT license and is also distributed as an installable package. The video is available on YouTube.
Related papers
- AllMetrics: A Unified Python Library for Standardized Metric Evaluation and Robust Data Validation in Machine Learning [2.325084918639609]
We introduce AllMetrics, an open-source unified Python library designed to standardize metric evaluation across diverse machine learning tasks.<n>The library implements class-specific reporting for multi-class tasks through parameters to cover all use cases.<n>Various datasets from domains like healthcare, finance, and real estate were applied to our library and compared with Python, Matlab, and R components.
arXiv Detail & Related papers (2025-05-21T18:36:05Z) - Rethinking Evaluation Metrics for Grammatical Error Correction: Why Use a Different Evaluation Process than Human? [13.02513034520894]
We propose an aggregation method for automatic evaluation metrics which aligns with human evaluation methods to bridge the gap.<n>We conducted experiments using various metrics, including edit-based metrics, $n$-gram based metrics, and sentence-level metrics, and show that resolving the gap improves results for the most of metrics on the SEEDA benchmark.
arXiv Detail & Related papers (2025-02-13T15:39:07Z) - Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.<n>Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.<n> Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z) - Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings [77.20838441870151]
We use an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments.<n>We collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts.<n>Our results indicate that edit distance exhibits the highest correlation with the online metric, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation.
arXiv Detail & Related papers (2024-10-15T20:32:07Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - CLEME: Debiasing Multi-reference Evaluation for Grammatical Error
Correction [32.44051877804761]
Chunk-LEvel Multi-reference Evaluation (CLEME) is designed to evaluate Grammatical Error Correction (GEC) systems in the multi-reference evaluation setting.
We conduct experiments on six English reference sets based on the CoNLL-2014 shared task.
arXiv Detail & Related papers (2023-05-18T08:57:17Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - MISeval: a Metric Library for Medical Image Segmentation Evaluation [1.4680035572775534]
There is no universal metric library in Python for standardized and reproducible evaluation.
We propose our open-source publicly available Python package MISeval: a metric library for Medical Image Evaluation.
arXiv Detail & Related papers (2022-01-23T23:06:47Z) - SacreROUGE: An Open-Source Library for Using and Developing
Summarization Evaluation Metrics [74.28810048824519]
SacreROUGE is an open-source library for using and developing summarization evaluation metrics.
The library provides Python wrappers around the official implementations of existing evaluation metrics.
It provides functionality to evaluate how well any metric implemented in the library correlates to human-annotated judgments.
arXiv Detail & Related papers (2020-07-10T13:26:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.