Related papers: Diff-XYZ: A Benchmark for Evaluating Diff Understanding

Diff-XYZ: A Benchmark for Evaluating Diff Understanding

URL: http://arxiv.org/abs/2510.12487v1
Date: Tue, 14 Oct 2025 13:23:01 GMT
Title: Diff-XYZ: A Benchmark for Evaluating Diff Understanding
Authors: Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, Alexander Bezzubov,
Abstract summary: We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks.<n> Instances in the benchmark are triples $langle textitold code, textitnew code, textitdiff rangle$ drawn from real commits in CommitPackFT.
Score: 38.94055952813874
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reliable handling of code diffs is central to agents that edit and refactor repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks: apply (old code $+$ diff $\rightarrow$ new code), anti-apply (new code $-$ diff $\rightarrow$ old code), and diff generation (new code $-$ old code $\rightarrow$ diff). Instances in the benchmark are triples $\langle \textit{old code}, \textit{new code}, \textit{diff} \rangle$ drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol. We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size. For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code. The dataset is published on HuggingFace Hub: https://huggingface.co/datasets/JetBrains-Research/diff-xyz.

Related papers

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z)
Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls [83.89771461061903]
Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs)<n>Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs)<n>We identify two key challenges contributing to this inefficiency: $textitover-exploration$ due to redundant states with semantically equivalent content, and $textitunder-exploration$ caused by high variance in verifier scoring.<n>We propose FETCH, a flexible, plug-and-play system compatible with various tree search algorithms.
arXiv Detail & Related papers (2025-02-16T16:12:01Z)
Reasoning to Attend: Try to Understand How <SEG> Token Works [44.33848900059659]
We show that the $texttSEG>$ token contributes to semantic similarity within image-text pairs.<n>We present READ, which facilitates LMMs' resilient $textbfREA$soning capability of where to atten$textbfD$ under the guidance of highly activated points.
arXiv Detail & Related papers (2024-12-23T17:44:05Z)
Toward Interactive Optimization of Source Code Differences: An Empirical Study of Its Performance [1.313675711285772]
We propose an interactive approach to optimize source code differences (diffs) Users can provide feedback for the points of a diff that should not be matched but are or parts that should be matched but are not. The results of 23 GitHub projects confirm that 92% of nonoptimal diffs can be addressed with less than four feedback actions in the ideal case.
arXiv Detail & Related papers (2024-09-20T15:43:55Z)
Describing Differences in Image Sets with Natural Language [101.80939666230168]
Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets. We introduce VisDiff, which first captions the images and prompts a language model to propose difference descriptions. We are able to find interesting and previously unknown differences in datasets and models, demonstrating VisDiff's utility in revealing nuanced insights.
arXiv Detail & Related papers (2023-12-05T18:59:16Z)
Boosting Commit Classification with Contrastive Learning [0.8655526882770742]
Commit Classification (CC) is an important task in software maintenance. We propose a contrastive learning-based commit classification framework. Our framework can solve the CC problem simply but effectively in fewshot scenarios.
arXiv Detail & Related papers (2023-08-16T10:02:36Z)
Augmenting Diffs With Runtime Information [53.22981451758425]
Collector-Sahab is a tool that augments code diffs with runtime difference information. We run Collector-Sahab on 584 code diffs for Defects4J bugs and find it successfully augments the code diff for 95% (555/584) of them.
arXiv Detail & Related papers (2022-12-20T16:33:51Z)
Beyond Invariance: Test-Time Label-Shift Adaptation for Distributions with "Spurious" Correlations [44.99833362998488]
Changes in the data distribution at test time can have deleterious effects on the performance of predictive models. We propose a test-time label shift correction that adapts to changes in the joint distribution $p(y, z)$ using EM applied to unlabeled samples.
arXiv Detail & Related papers (2022-11-28T18:52:33Z)
LMdiff: A Visual Diff Tool to Compare Language Models [25.229215469012637]
LMdiff is a tool that visually compares probability distributions of two models that differ. We showcase the applicability of LMdiff for hypothesis generation across multiple case studies.
arXiv Detail & Related papers (2021-11-02T13:17:20Z)
Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information [67.25713071340518]
Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans.<n>We frame dataset difficulty as the lack of $mathcalV$-$textitusable information.<n>We also introduce $textitpointwise $mathcalV$-information$ (PVI) for measuring the difficulty of individual instances.
arXiv Detail & Related papers (2021-10-16T00:21:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.