Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair
- URL: http://arxiv.org/abs/2511.11012v1
- Date: Fri, 14 Nov 2025 07:00:47 GMT
- Title: Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair
- Authors: Noor Nashid, Daniel Ding, Keheliya Gallaba, Ahmed E. Hassan, Ali Mesbah,
- Abstract summary: Repairing multi-hunk bugs requires coordinated edits across multiple, disjoint code regions.<n>We evaluate coding agents (Claude Code, Codex, Gemini-cli, and Qwen Code) on 372 multi-hunk bugs from the Hunk4J dataset.
- Score: 6.60715519922201
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated program repair has traditionally focused on single-hunk defects, overlooking multi-hunk bugs that are prevalent in real-world systems. Repairing these bugs requires coordinated edits across multiple, disjoint code regions, posing substantially greater challenges. We present the first systematic study of LLM-driven coding agents (Claude Code, Codex, Gemini-cli, and Qwen Code) on this task. We evaluate these agents on 372 multi-hunk bugs from the Hunk4J dataset, analyzing 1,488 repair trajectories using fine-grained metrics that capture localization, repair accuracy, regression behavior, and operational dynamics. Results reveal substantial variation: repair accuracy ranges from 25.8% (Qwen Code) to 93.3% (Claude Code) and consistently declines with increasing bug dispersion and complexity. High-performing agents demonstrate superior semantic consistency, achieving positive regression reduction, whereas lower-performing agents often introduce new test failures. Notably, agents do not fail fast; failed repairs consume substantially more resources (39%-343% more tokens) and require longer execution time (43%-427%). Additionally, we developed Maple to provide agents with repository-level context. Empirical results show that Maple improves the repair accuracy of Gemini-cli by 30% through enhanced localization. By analyzing fine-grained metrics and trajectory-level analysis, this study moves beyond accuracy to explain how coding agents localize, reason, and act during multi-hunk repair.
Related papers
- SGAgent: Suggestion-Guided LLM-Based Multi-Agent Framework for Repository-Level Software Repair [22.745971570878435]
We propose a Suggestion-Guided multi-Agent framework for repository-level software repair.<n> SGAgent introduces a suggestion phase to strengthen the transition from localization to repair.<n>Three specialized sub-agents collaborate to achieve automated end-to-end software repair.
arXiv Detail & Related papers (2026-02-27T03:32:47Z) - TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code [11.207330722400764]
We present TraceCoder, a framework that emulates the observe-analyze-repair process of human experts.<n>The framework first instruments the code with diagnostic probes to capture fine-grained runtime traces.<n>It then conducts causal analysis on these traces to accurately identify the root cause of the failure.
arXiv Detail & Related papers (2026-02-06T16:59:48Z) - Beyond Bug Fixes: An Empirical Investigation of Post-Merge Code Quality Issues in Agent-Generated Pull Requests [4.744786007044749]
We analyze 1,210 merged agent-generated bug-fix PRs from Python repositories in the AIDev dataset.<n>Our results show that apparent differences in raw issue counts across agents largely disappear after normalizing by code churn.<n>Across all agents, code smells dominate, particularly at critical and major severities, while bugs are less frequent but often severe.
arXiv Detail & Related papers (2026-01-27T22:55:05Z) - DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems [48.971606069204825]
DoVer is an intervention-driven debug framework for large language model (LLM)-based multi-agent systems.<n>It augments hypothesis generation with active verification through targeted interventions.<n>DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses.
arXiv Detail & Related papers (2025-12-07T09:23:48Z) - BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [59.003563837981886]
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents.<n>We introduce a novel method for synthetic generation of difficult and diverse bugs.
arXiv Detail & Related papers (2025-10-22T17:58:56Z) - Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction [58.51530390018909]
Large Language Model based multi-agent systems excel at collaborative problem solving but remain brittle to cascading errors.<n>We present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction.
arXiv Detail & Related papers (2025-10-16T05:35:37Z) - Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation [87.47155146067962]
We provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of tasks.<n>We conduct three-dimensional analysis spanning models, scaffolds, and benchmarks.<n>Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs.
arXiv Detail & Related papers (2025-10-13T22:22:28Z) - Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z) - Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z) - An Empirical Study on Failures in Automated Issue Solving [12.571536148821144]
We analyze the performance and efficiency of three SOTA tools, spanning both pipeline-based and agentic architectures, in automated issue solving tasks of SWE-Bench-Verified.<n>To move from high-level performance metrics to underlying cause analysis, we conducted a systematic manual analysis of 150 failed instances.<n>The results reveal distinct failure fingerprints between the two architectural paradigms, with the majority of agentic failures stemming from flawed reasoning and cognitive deadlocks.
arXiv Detail & Related papers (2025-09-17T13:07:52Z) - Automated Repair of C Programs Using Large Language Models [0.0]
This study explores the potential of Large Language Models (LLMs) in automating the repair of C programs.<n>We present a framework that integrates spectrum-based fault localization (SBFL), runtime feedback, and Chain-of-Thought-structured prompting into an autonomous repair loop.<n>Our approach achieves 44.93% repair accuracy, representing a 3.61% absolute improvement over strong state-of-the-art APR baselines.
arXiv Detail & Related papers (2025-09-02T04:34:11Z) - Boosting Redundancy-based Automated Program Repair by Fine-grained Pattern Mining [18.7107522872479]
We propose a new repair technique named Repatt, which incorporates a two-level pattern mining process for guiding effective patch generation.<n>We have conducted an experiment on the widely-used Defects4J benchmark and compared Repatt with ten state-of-the-art APR approaches.
arXiv Detail & Related papers (2023-12-26T08:42:32Z) - Generating Bug-Fixes Using Pretrained Transformers [11.012132897417592]
We introduce a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub.
We show that pretraining on source code programs improves the number of patches found by 33% as compared to supervised training from scratch.
We refine the standard accuracy evaluation metric into non-deletion and deletion-only fixes, and show that our best model generates 75% more non-deletion fixes than the previous state of the art.
arXiv Detail & Related papers (2021-04-16T05:27:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.