Diff-XYZ: A Benchmark for Evaluating Diff Understanding
- URL: http://arxiv.org/abs/2510.12487v1
- Date: Tue, 14 Oct 2025 13:23:01 GMT
- Title: Diff-XYZ: A Benchmark for Evaluating Diff Understanding
- Authors: Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, Alexander Bezzubov,
- Abstract summary: We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks.<n> Instances in the benchmark are triples $langle textitold code, textitnew code, textitdiff rangle$ drawn from real commits in CommitPackFT.
- Score: 38.94055952813874
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reliable handling of code diffs is central to agents that edit and refactor repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks: apply (old code $+$ diff $\rightarrow$ new code), anti-apply (new code $-$ diff $\rightarrow$ old code), and diff generation (new code $-$ old code $\rightarrow$ diff). Instances in the benchmark are triples $\langle \textit{old code}, \textit{new code}, \textit{diff} \rangle$ drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol. We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size. For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code. The dataset is published on HuggingFace Hub: https://huggingface.co/datasets/JetBrains-Research/diff-xyz.
Related papers
- AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z) - Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls [83.89771461061903]
Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs)<n>Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs)<n>We identify two key challenges contributing to this inefficiency: $textitover-exploration$ due to redundant states with semantically equivalent content, and $textitunder-exploration$ caused by high variance in verifier scoring.<n>We propose FETCH, a flexible, plug-and-play system compatible with various tree search algorithms.
arXiv Detail & Related papers (2025-02-16T16:12:01Z) - Reasoning to Attend: Try to Understand How <SEG> Token Works [44.33848900059659]
We show that the $texttSEG>$ token contributes to semantic similarity within image-text pairs.<n>We present READ, which facilitates LMMs' resilient $textbfREA$soning capability of where to atten$textbfD$ under the guidance of highly activated points.
arXiv Detail & Related papers (2024-12-23T17:44:05Z) - Toward Interactive Optimization of Source Code Differences: An Empirical Study of Its Performance [1.313675711285772]
We propose an interactive approach to optimize source code differences (diffs)
Users can provide feedback for the points of a diff that should not be matched but are or parts that should be matched but are not.
The results of 23 GitHub projects confirm that 92% of nonoptimal diffs can be addressed with less than four feedback actions in the ideal case.
arXiv Detail & Related papers (2024-09-20T15:43:55Z) - Describing Differences in Image Sets with Natural Language [101.80939666230168]
Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets.
We introduce VisDiff, which first captions the images and prompts a language model to propose difference descriptions.
We are able to find interesting and previously unknown differences in datasets and models, demonstrating VisDiff's utility in revealing nuanced insights.
arXiv Detail & Related papers (2023-12-05T18:59:16Z) - Boosting Commit Classification with Contrastive Learning [0.8655526882770742]
Commit Classification (CC) is an important task in software maintenance.
We propose a contrastive learning-based commit classification framework.
Our framework can solve the CC problem simply but effectively in fewshot scenarios.
arXiv Detail & Related papers (2023-08-16T10:02:36Z) - Augmenting Diffs With Runtime Information [53.22981451758425]
Collector-Sahab is a tool that augments code diffs with runtime difference information.
We run Collector-Sahab on 584 code diffs for Defects4J bugs and find it successfully augments the code diff for 95% (555/584) of them.
arXiv Detail & Related papers (2022-12-20T16:33:51Z) - Beyond Invariance: Test-Time Label-Shift Adaptation for Distributions
with "Spurious" Correlations [44.99833362998488]
Changes in the data distribution at test time can have deleterious effects on the performance of predictive models.
We propose a test-time label shift correction that adapts to changes in the joint distribution $p(y, z)$ using EM applied to unlabeled samples.
arXiv Detail & Related papers (2022-11-28T18:52:33Z) - LMdiff: A Visual Diff Tool to Compare Language Models [25.229215469012637]
LMdiff is a tool that visually compares probability distributions of two models that differ.
We showcase the applicability of LMdiff for hypothesis generation across multiple case studies.
arXiv Detail & Related papers (2021-11-02T13:17:20Z) - Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information [67.25713071340518]
Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans.<n>We frame dataset difficulty as the lack of $mathcalV$-$textitusable information.<n>We also introduce $textitpointwise $mathcalV$-information$ (PVI) for measuring the difficulty of individual instances.
arXiv Detail & Related papers (2021-10-16T00:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.