Related papers: Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair

Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair

URL: http://arxiv.org/abs/2511.11012v1
Date: Fri, 14 Nov 2025 07:00:47 GMT
Title: Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair
Authors: Noor Nashid, Daniel Ding, Keheliya Gallaba, Ahmed E. Hassan, Ali Mesbah,
Abstract summary: Repairing multi-hunk bugs requires coordinated edits across multiple, disjoint code regions.<n>We evaluate coding agents (Claude Code, Codex, Gemini-cli, and Qwen Code) on 372 multi-hunk bugs from the Hunk4J dataset.
Score: 6.60715519922201
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated program repair has traditionally focused on single-hunk defects, overlooking multi-hunk bugs that are prevalent in real-world systems. Repairing these bugs requires coordinated edits across multiple, disjoint code regions, posing substantially greater challenges. We present the first systematic study of LLM-driven coding agents (Claude Code, Codex, Gemini-cli, and Qwen Code) on this task. We evaluate these agents on 372 multi-hunk bugs from the Hunk4J dataset, analyzing 1,488 repair trajectories using fine-grained metrics that capture localization, repair accuracy, regression behavior, and operational dynamics. Results reveal substantial variation: repair accuracy ranges from 25.8% (Qwen Code) to 93.3% (Claude Code) and consistently declines with increasing bug dispersion and complexity. High-performing agents demonstrate superior semantic consistency, achieving positive regression reduction, whereas lower-performing agents often introduce new test failures. Notably, agents do not fail fast; failed repairs consume substantially more resources (39%-343% more tokens) and require longer execution time (43%-427%). Additionally, we developed Maple to provide agents with repository-level context. Empirical results show that Maple improves the repair accuracy of Gemini-cli by 30% through enhanced localization. By analyzing fine-grained metrics and trajectory-level analysis, this study moves beyond accuracy to explain how coding agents localize, reason, and act during multi-hunk repair.

Related papers

SGAgent: Suggestion-Guided LLM-Based Multi-Agent Framework for Repository-Level Software Repair [22.745971570878435]
We propose a Suggestion-Guided multi-Agent framework for repository-level software repair.<n> SGAgent introduces a suggestion phase to strengthen the transition from localization to repair.<n>Three specialized sub-agents collaborate to achieve automated end-to-end software repair.
arXiv Detail & Related papers (2026-02-27T03:32:47Z)
TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code [11.207330722400764]
We present TraceCoder, a framework that emulates the observe-analyze-repair process of human experts.<n>The framework first instruments the code with diagnostic probes to capture fine-grained runtime traces.<n>It then conducts causal analysis on these traces to accurately identify the root cause of the failure.
arXiv Detail & Related papers (2026-02-06T16:59:48Z)
Beyond Bug Fixes: An Empirical Investigation of Post-Merge Code Quality Issues in Agent-Generated Pull Requests [4.744786007044749]
We analyze 1,210 merged agent-generated bug-fix PRs from Python repositories in the AIDev dataset.<n>Our results show that apparent differences in raw issue counts across agents largely disappear after normalizing by code churn.<n>Across all agents, code smells dominate, particularly at critical and major severities, while bugs are less frequent but often severe.
arXiv Detail & Related papers (2026-01-27T22:55:05Z)
DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems [48.971606069204825]
DoVer is an intervention-driven debug framework for large language model (LLM)-based multi-agent systems.<n>It augments hypothesis generation with active verification through targeted interventions.<n>DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses.
arXiv Detail & Related papers (2025-12-07T09:23:48Z)
BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [59.003563837981886]
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents.<n>We introduce a novel method for synthetic generation of difficult and diverse bugs.
arXiv Detail & Related papers (2025-10-22T17:58:56Z)
Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction [58.51530390018909]
Large Language Model based multi-agent systems excel at collaborative problem solving but remain brittle to cascading errors.<n>We present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction.
arXiv Detail & Related papers (2025-10-16T05:35:37Z)
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation [87.47155146067962]
We provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of tasks.<n>We conduct three-dimensional analysis spanning models, scaffolds, and benchmarks.<n>Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs.
arXiv Detail & Related papers (2025-10-13T22:22:28Z)
Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
An Empirical Study on Failures in Automated Issue Solving [12.571536148821144]
We analyze the performance and efficiency of three SOTA tools, spanning both pipeline-based and agentic architectures, in automated issue solving tasks of SWE-Bench-Verified.<n>To move from high-level performance metrics to underlying cause analysis, we conducted a systematic manual analysis of 150 failed instances.<n>The results reveal distinct failure fingerprints between the two architectural paradigms, with the majority of agentic failures stemming from flawed reasoning and cognitive deadlocks.
arXiv Detail & Related papers (2025-09-17T13:07:52Z)
Automated Repair of C Programs Using Large Language Models [0.0]
This study explores the potential of Large Language Models (LLMs) in automating the repair of C programs.<n>We present a framework that integrates spectrum-based fault localization (SBFL), runtime feedback, and Chain-of-Thought-structured prompting into an autonomous repair loop.<n>Our approach achieves 44.93% repair accuracy, representing a 3.61% absolute improvement over strong state-of-the-art APR baselines.
arXiv Detail & Related papers (2025-09-02T04:34:11Z)
Boosting Redundancy-based Automated Program Repair by Fine-grained Pattern Mining [18.7107522872479]
We propose a new repair technique named Repatt, which incorporates a two-level pattern mining process for guiding effective patch generation.<n>We have conducted an experiment on the widely-used Defects4J benchmark and compared Repatt with ten state-of-the-art APR approaches.
arXiv Detail & Related papers (2023-12-26T08:42:32Z)
Generating Bug-Fixes Using Pretrained Transformers [11.012132897417592]
We introduce a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub. We show that pretraining on source code programs improves the number of patches found by 33% as compared to supervised training from scratch. We refine the standard accuracy evaluation metric into non-deletion and deletion-only fixes, and show that our best model generates 75% more non-deletion fixes than the previous state of the art.
arXiv Detail & Related papers (2021-04-16T05:27:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.