Related papers: A Formal Framework for Fluency-based Multi-Reference Evaluation in Grammatical Error Correction

A Formal Framework for Fluency-based Multi-Reference Evaluation in Grammatical Error Correction

URL: http://arxiv.org/abs/2510.06749v1
Date: Wed, 08 Oct 2025 08:15:44 GMT
Title: A Formal Framework for Fluency-based Multi-Reference Evaluation in Grammatical Error Correction
Authors: Eitan Klinger, Zihao Huang, Tran Minh Nguyen, Emma Jayeon Park, Yige Chen, Yang Gu, Qingyu Gao, Siliang Liu, Mengyang Qiu, Jungyeul Park,
Abstract summary: Existing frameworks, largely edit-based and English-centric, rely on rigid alignments between system and reference edits.<n>This paper introduces a formal framework for textitfluency-based multi-reference evaluation, framing $n$-gram similarity as an aggregation problem over multiple legitimate corrections.
Score: 9.020566998995696
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating grammatical error correction requires metrics that reflect the diversity of valid human corrections rather than privileging a single reference. Existing frameworks, largely edit-based and English-centric, rely on rigid alignments between system and reference edits, limiting their applicability in multilingual and generative settings. This paper introduces a formal framework for \textit{fluency-based multi-reference evaluation}, framing $n$-gram similarity as an aggregation problem over multiple legitimate corrections. Within this formulation, we instantiate GLEU through four aggregation strategies--\textsc{select-best}, \textsc{simple-average}, \textsc{weighted-average}, and \textsc{merged-counts}--and analyze their properties of boundedness, monotonicity, and sensitivity to reference variation. Empirical results on Czech, Estonian, Ukrainian, and Chinese corpora show that these strategies capture complementary aspects of fluency and coverage. The framework unifies multi-reference evaluation into a principled, fluency-oriented approach that incorporates linguistic diversity without penalizing legitimate variation.

Related papers

A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments [2.4384521157164345]
This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists.<n>The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity.<n>Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research.
arXiv Detail & Related papers (2026-02-12T10:19:50Z)
BiasLab: A Multilingual, Dual-Framing Framework for Robust Measurement of Output-Level Bias in Large Language Models [3.643198597030366]
This paper introduces BiasLab, an open-source, model-agnostic evaluation framework for quantifying output-level (extrinsic) bias.<n>The framework supports evaluation across diverse bias axes, including demographic, cultural, political, and geopolitical topics.
arXiv Detail & Related papers (2026-01-11T11:07:46Z)
SCORE: A Semantic Evaluation Framework for Generative Document Parsing [2.5101597298392098]
Multi-modal generative document parsing systems produce semantically correct yet structurally divergent outputs.<n>Conventional metrics-CER, WER, IoU, or TEDS-misclassify such diversity as error, penalizing valid interpretations and obscuring system behavior.<n>We introduce SCORE, an interpretation-agnostic framework that integrates (i) adjusted edit distance for robust content fidelity, (ii) token-level diagnostics to distinguish hallucinations from omissions, (iii) table evaluation with spatial tolerance and semantic alignment, and (iv) hierarchy-aware consistency checks.
arXiv Detail & Related papers (2025-09-16T16:06:19Z)
Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric. We propose a single framework built on their equivalence in learning scenarios. Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z)
Cross-domain Chinese Sentence Pattern Parsing [67.1381983012038]
Sentence Pattern Structure (SPS) parsing is a syntactic analysis method primarily employed in language teaching. Existing SPSs rely heavily on textbook corpora for training, lacking cross-domain capability. This paper proposes an innovative approach leveraging large language models (LLMs) within a self-training framework.
arXiv Detail & Related papers (2024-02-26T05:30:48Z)
CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction [32.44051877804761]
Chunk-LEvel Multi-reference Evaluation (CLEME) is designed to evaluate Grammatical Error Correction (GEC) systems in the multi-reference evaluation setting. We conduct experiments on six English reference sets based on the CoNLL-2014 shared task.
arXiv Detail & Related papers (2023-05-18T08:57:17Z)
Retrofitting Multilingual Sentence Embeddings with Abstract Meaning Representation [70.58243648754507]
We introduce a new method to improve existing multilingual sentence embeddings with Abstract Meaning Representation (AMR) Compared with the original textual input, AMR is a structured semantic representation that presents the core concepts and relations in a sentence explicitly and unambiguously. Experiment results show that retrofitting multilingual sentence embeddings with AMR leads to better state-of-the-art performance on both semantic similarity and transfer tasks.
arXiv Detail & Related papers (2022-10-18T11:37:36Z)
On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks. We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments. We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z)
Unsupervised Word Translation Pairing using Refinement based Point Set Registration [8.568050813210823]
Cross-lingual alignment of word embeddings play an important role in knowledge transfer across languages. Current unsupervised approaches rely on similarities in geometric structure of word embedding spaces across languages. This paper proposes BioSpere, a novel framework for unsupervised mapping of bi-lingual word embeddings onto a shared vector space.
arXiv Detail & Related papers (2020-11-26T09:51:29Z)
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity. This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
Multilingual Alignment of Contextual Word Representations [49.42244463346612]
BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model. We introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer. These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.
arXiv Detail & Related papers (2020-02-10T03:27:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.