Related papers: Excision Score: Evaluating Edits with Surgical Precision

Excision Score: Evaluating Edits with Surgical Precision

URL: http://arxiv.org/abs/2510.21537v1
Date: Fri, 24 Oct 2025 15:01:44 GMT
Title: Excision Score: Evaluating Edits with Surgical Precision
Authors: Nikolai Gruzinov, Ksenia Sycheva, Earl T. Barr, Alex Bezzubov,
Abstract summary: We formulate the revision similarity problem to unify a wide range of machine learning evaluation problems.<n>We show that popular pairwise measures, like BLEU, fail to meet these criteria because their scores are dominated by the shared content.<n>We propose a novel static measure, Excision Score (ES), which computes longest common subsequence.
Score: 2.352496216126117
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many tasks revolve around editing a document, whether code or text. We formulate the revision similarity problem to unify a wide range of machine learning evaluation problems whose goal is to assess a revision to an existing document. We observe that revisions usually change only a small portion of an existing document, so the existing document and its immediate revisions share a majority of their content. We formulate five adequacy criteria for revision similarity measures, designed to align them with human judgement. We show that popular pairwise measures, like BLEU, fail to meet these criteria, because their scores are dominated by the shared content. They report high similarity between two revisions when humans would assess them as quite different. This is a fundamental flaw we address. We propose a novel static measure, Excision Score (ES), which computes longest common subsequence (LCS) to remove content shared by an existing document with the ground truth and predicted revisions, before comparing only the remaining divergent regions. This is analogous to a surgeon creating a sterile field to focus on the work area. We use approximation to speed the standard cubic LCS computation to quadratic. In code-editing evaluation, where static measures are often used as a cheap proxy for passing tests, we demonstrate that ES surpasses existing measures. When aligned with test execution on HumanEvalFix, ES improves over its nearest competitor, SARI, by 12% Pearson correlation and by >21% over standard measures like BLEU. The key criterion is invariance to shared context; when we perturb HumanEvalFix with increased shared context, ES' improvement over SARI increases to 20% and >30% over standard measures. ES also handles other corner cases that other measures do not, such as correctly aligning moved code blocks, and appropriately rewarding matching insertions or deletions.

Related papers

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam [63.84155758655084]
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models.<n>We introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy.<n>We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points.
arXiv Detail & Related papers (2026-02-15T02:50:15Z)
Grammatical Error Correction Evaluation by Optimally Transporting Edit Representation [34.071151696990384]
This study focuses on edits specifically designed for grammatical error correction (GEC)<n>We propose edit vector, a representation for an edit, and introduce a new metric, UOT-ERRANT, which transports these edit vectors from hypothesis to reference using unbalanced optimal transport.<n>Experiments with SEEDA meta-evaluation show that UOT-ERRANT improves evaluation performance, particularly in the +Fluency domain.
arXiv Detail & Related papers (2026-02-05T08:05:42Z)
The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure [98.71456610527598]
Embedding-based similarity metrics can be influenced by spurious attributes like the text's source or language.<n>This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost.
arXiv Detail & Related papers (2025-07-01T23:17:12Z)
Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed. We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document. Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z)
Revisiting Meta-evaluation for Grammatical Error Correction [14.822205658480813]
SEEDA is a new dataset for GEC meta-evaluation. It consists of corrections with human ratings along two different granularities. The results suggest that edit-based metrics may have been underestimated in existing studies.
arXiv Detail & Related papers (2024-03-05T05:53:09Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction [32.44051877804761]
Chunk-LEvel Multi-reference Evaluation (CLEME) is designed to evaluate Grammatical Error Correction (GEC) systems in the multi-reference evaluation setting. We conduct experiments on six English reference sets based on the CoNLL-2014 shared task.
arXiv Detail & Related papers (2023-05-18T08:57:17Z)
SMATCH++: Standardized and Extended Evaluation of Semantic Graphs [4.987581730476023]
The Smatch metric is a popular method for evaluating graph distances. We show how to fully conform to annotation guidelines that allow structurally deviating but valid graphs. For improved scoring, we propose standardized and extended metric calculation of fine-grained sub-graph meaning aspects.
arXiv Detail & Related papers (2023-05-11T17:29:47Z)
End-to-End Page-Level Assessment of Handwritten Text Recognition [69.55992406968495]
HTR systems increasingly face the end-to-end page-level transcription of a document. Standard metrics do not take into account the inconsistencies that might appear. We propose a two-fold evaluation, where the transcription accuracy and the RO goodness are considered separately.
arXiv Detail & Related papers (2023-01-14T15:43:07Z)
Integrating Rankings into Quantized Scores in Peer Review [61.27794774537103]
In peer review, reviewers are usually asked to provide scores for the papers. To mitigate this issue, conferences have started to ask reviewers to additionally provide a ranking of the papers they have reviewed. There are no standard procedure for using this ranking information and Area Chairs may use it in different ways. We take a principled approach to integrate the ranking information into the scores.
arXiv Detail & Related papers (2022-04-05T19:39:13Z)
Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised Deep Asymmetric Metric Learning [62.34197797857823]
A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds. This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly. Our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds.
arXiv Detail & Related papers (2020-03-23T03:22:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.