Abstain and Validate: A Dual-LLM Policy for Reducing Noise in Agentic Program Repair
- URL: http://arxiv.org/abs/2510.03217v1
- Date: Fri, 03 Oct 2025 17:53:28 GMT
- Title: Abstain and Validate: A Dual-LLM Policy for Reducing Noise in Agentic Program Repair
- Authors: José Cambronero, Michele Tufano, Sherry Shi, Renyao Wei, Grant Uy, Runxiang Cheng, Chin-Jung Liu, Shiying Pan, Satish Chandra, Pat Rondon,
- Abstract summary: Agentic Automated Program Repair (APR) is increasingly tackling complex, repository-level bugs in industry.<n>Showing unlikely patches to developers can lead to substantial noise.<n>We introduce two complementary policies to reduce such noise: bug abstention and patch validation policies.
- Score: 7.118712516789191
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Agentic Automated Program Repair (APR) is increasingly tackling complex, repository-level bugs in industry, but ultimately agent-generated patches still need to be reviewed by a human before committing them to ensure they address the bug. Showing unlikely patches to developers can lead to substantial noise, wasting valuable developer time and eroding trust in automated code changes. We introduce two complementary LLM-based policies to reduce such noise: bug abstention and patch validation policies. Bug abstention excludes bugs that the agentic APR system is unlikely to fix. Patch validation rejects patches that are unlikely to be a good fix for the given bug. We evaluate both policies on three sets of bugs from Google's codebase, and their candidate patches generated by an internal agentic APR system. On a set of 174 human-reported bugs, removing bugs and patch trajectories rejected by our policies can raise success rates by up to 13 percentage points and 15 percentage points, respectively, and by up to 39 percentage points in combination. On null pointer exceptions and sanitizer-reported bugs with machine-generated bug reports, patch validation also improves average single-sample success rates. This two-policy approach provides a practical path to the reliable, industrial-scale deployment of agentic APR systems.
Related papers
- ComPass: Contrastive Learning for Automated Patch Correctness Assessment in Program Repair [20.606877071567958]
We present ComPass, a pre-trained language model (PLM)-based automated patch correctness assessment approach.<n>We show that ComPass achieves an accuracy of 88.35%, significantly outperforming state-of-the-art baseline APPT.
arXiv Detail & Related papers (2026-02-07T14:17:21Z) - Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All [57.23434868678603]
Live-kBench is an evaluation framework for self-evolving benchmarks that scrapes and evaluates agents on freshly discovered kernel bugs.<n> kEnv is an agent-agnostic crash-resolution environment for kernel compilation, execution, and feedback.<n>Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt.
arXiv Detail & Related papers (2026-02-02T19:06:15Z) - BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [59.003563837981886]
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents.<n>We introduce a novel method for synthetic generation of difficult and diverse bugs.
arXiv Detail & Related papers (2025-10-22T17:58:56Z) - REFINE: Enhancing Program Repair Agents through Context-Aware Patch Refinement [12.995571513415905]
Large Language Models (LLMs) have recently shown strong potential in automatic program repair (APR)<n>LLMs often struggle to produce correct fixes due to limited understanding of code context and over-reliance on incomplete test suites.<n>We propose a novel patch refinement framework, Refine, that systematically transforms Draft Patches into correct ones.
arXiv Detail & Related papers (2025-10-04T00:34:32Z) - Red Teaming Program Repair Agents: When Correct Patches can Hide Vulnerabilities [22.02073334787359]
We propose SWExploit, which generates adversarial issue statements to make APR agents produce patches that are functionally correct yet vulnerable.<n>Based on our evaluation, we are the first to challenge the traditional assumption that a patch passing all tests is inherently reliable and secure.
arXiv Detail & Related papers (2025-09-30T07:38:57Z) - Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z) - What Do They Fix? LLM-Aided Categorization of Security Patches for Critical Memory Bugs [46.325755802511026]
We developLM, a dual-method pipeline that integrates two approaches based on a Large Language Model (LLM) and a fine-tuned small language model.<n>LM successfully identified 111 of 5,140 recent Linux kernel patches addressing OOB or UAF vulnerabilities, with 90 true positives confirmed by manual verification.
arXiv Detail & Related papers (2025-09-26T18:06:36Z) - VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z) - Adversarial Bug Reports as a Security Risk in Language Model-Based Automated Program Repair [1.1677624591989955]
Automated Program Repair (APR) systems are increasingly integrated into modern software development.<n>In this paper, we investigate the security risks posed by adversarial bug reports.<n>We develop a comprehensive threat model and conduct an empirical study to evaluate the vulnerability of state-of-the-art APR systems to such attacks.
arXiv Detail & Related papers (2025-09-04T09:41:57Z) - Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study [18.117047833029073]
Most popular benchmarks for automated issue solving are SWE-bench and its human-filtered subset SWE-bench Verified.<n>This paper presents an in-depth empirical study of the correctness of plausible patches generated by three state-of-the-art issue-solving tools evaluated on SWE-bench Verified.
arXiv Detail & Related papers (2025-03-19T14:02:21Z) - Evaluating Agent-based Program Repair at Google [9.62742759337993]
Agent-based program repair offers to automatically resolve complex bugs end-to-end.<n>Recent work has explored the use of agent-based repair approaches on the popular open-source SWE-Bench.<n>This paper explores the viability of using an agentic approach to address bugs in an enterprise context.
arXiv Detail & Related papers (2025-01-13T18:09:25Z) - SoftPatch+: Fully Unsupervised Anomaly Classification and Segmentation [84.07909405887696]
This paper is the first to consider fully unsupervised industrial anomaly detection (i.e., unsupervised AD with noisy data)<n>We propose memory-based unsupervised AD methods, SoftPatch and SoftPatch+, which efficiently denoise the data at the patch level.<n>Compared with existing methods, SoftPatch maintains a strong modeling ability of normal data and alleviates the overconfidence problem in coreset.<n> Comprehensive experiments conducted in diverse noise scenarios demonstrate that both SoftPatch and SoftPatch+ outperform the state-of-the-art AD methods on the MVTecAD, ViSA, and BTAD benchmarks.
arXiv Detail & Related papers (2024-12-30T11:16:49Z) - RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic
Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen)
RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs.
We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.