Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback
- URL: http://arxiv.org/abs/2507.18755v1
- Date: Thu, 24 Jul 2025 19:12:32 GMT
- Title: Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback
- Authors: Chandra Maddila, Adam Tait, Claire Chang, Daniel Cheng, Nauman Ahmad, Vijayaraghavan Murali, Marshall Roch, Arnaud Avondet, Aaron Meltzer, Victor Montalvao, Michael Hopko, Chris Waterson, Parth Thakkar, Renuka Fernandez, Kristian Kristensen, Sivan Barzily, Sherry Chen, Rui Abreu, Nachiappan Nagappan, Payam Shodjai, Killian Murphy, James Everingham, Aparna Ramani, Peter C. Rigby,
- Abstract summary: We develop an Engineering Agent that fixes the source code based on test failures at scale across diverse software offerings.<n>We provide feedback to the agent through static analysis and test failures so it can refine its solution.<n>In a three month period, 80% of the generated fixes were reviewed, of which 31.5% were landed.
- Score: 11.070932612938154
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Aim: With the advent of LLMs, sophisticated agentic program repair has become viable at large organizations with large codebases. In this work, we develop an Engineering Agent that fixes the source code based on test failures at scale across diverse software offerings internally. Method: Using Llama as the base, we employ the ReAct harness to develop an agent. We start with a test failure that was triaged by a rule-based test failure bot. We then set up an agentic harness and allow the agent to reason and run a set of 15 actions from reading a file to generating a patch. We provide feedback to the agent through static analysis and test failures so it can refine its solution. We leverage an LLM-as-a-Judge to ensure that the patch conforms to the standards followed by a human review to land fixes. Benchmark Findings: We curated offline benchmarks for our patch generator, the Engineering Agent loop, and the LLM-as-a-Judge. In offline evaluations we found that a specialized 70B model is highly competitive with the much larger but vanilla Llama-405B. In an ablation study, we found that the ReAct harness (neural model) benefited from the symbolic information from static analysis tools and test execution traces. A model that strikes a balance between the solve rate and error rate vs the cost and latency has a benchmark solve rate of 42.3% using an average 11.8 feedback iterations. Production Findings: In a three month period, 80% of the generated fixes were reviewed, of which 31.5% were landed (25.5% of the total number of generated fixes). Feedback from Engineers: We used open coding to extract qualitative themes from engineers' feedback. We saw positive feedback in the form of quick approvals, gratitude, and surprise. We also found mixed feedback when the Engineering Agent's solution was partially correct and it served as a good starting point.
Related papers
- RePaCA: Leveraging Reasoning Large Language Models for Static Automated Patch Correctness Assessment [0.0]
We introduce RePaCA, a novel static APCA technique that leverages Large Language Models (LLMs) specialized in thinking tasks.<n>Our approach achieves state-of-the-art performance, with 83.1% accuracy and an 84.8% F1-score.
arXiv Detail & Related papers (2025-07-30T11:21:09Z) - StaAgent: An Agentic Framework for Testing Static Analyzers [7.951459111292028]
StaAgent is an agentic framework that harnesses the generative capabilities of Large Language Models (LLMs) to systematically evaluate static analyzer rules.<n>StaAgent helps uncover flaws in rule implementations by revealing inconsistent behaviors.<n>We evaluated StaAgent with five state-of-the-art LLMs across five widely used static analyzers.
arXiv Detail & Related papers (2025-07-20T13:41:02Z) - May the Feedback Be with You! Unlocking the Power of Feedback-Driven Deep Learning Framework Fuzzing via LLMs [13.976286931563006]
A simple yet effective way to find bugs in Deep Learning (DL) frameworks is fuzz testing (Fuzzing)<n>We propose FUEL to break the seal of Feedback-driven fuzzing for DL frameworks.<n> FUEL has detected 104 bugs for PyTorch and summaries, with 93 confirmed as new bugs, 47 already fixed, and 5 assigned with CVE IDs.
arXiv Detail & Related papers (2025-06-21T08:51:53Z) - Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.<n>We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.<n>The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z) - AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.<n>We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.<n> Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z) - AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories [59.214178488091584]
We propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents.<n>Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks.<n>We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents.
arXiv Detail & Related papers (2025-04-11T19:49:22Z) - On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution [22.03052751722933]
Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads.<n>We have identified the most prevalent errors -- such as ModuleNotFoundError and TypeError -- and highlighted particularly challenging errors like OSError and database-related issues.
arXiv Detail & Related papers (2025-03-16T06:24:51Z) - Large Language Model Critics for Execution-Free Evaluation of Code Changes [5.1973075342632535]
Large language models (LLMs) offer a promising way to automate software engineering tasks.<n>Existing metrics for evaluating such, mainly build status and occasionally log analysis, are too sparse and limited in providing the information needed to assess the quality of changes made.<n>In this work, we designed LLM-based critics to derive well-structured and rigorous intermediate/step-level, execution-free evaluation proxies for executability of code changes.
arXiv Detail & Related papers (2025-01-28T02:38:56Z) - A Unified Debugging Approach via LLM-Based Multi-Agent Synergy [39.11825182386288]
FixAgent is an end-to-end framework for unified debug through multi-agent synergy.
It significantly outperforms state-of-the-art repair methods, fixing 1.25$times$ to 2.56$times$ bugs on the repo-level benchmark, Defects4J.
arXiv Detail & Related papers (2024-04-26T04:55:35Z) - MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM.
For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs.
We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z) - Can Agents Run Relay Race with Strangers? Generalization of RL to
Out-of-Distribution Trajectories [88.08381083207449]
We show the prevalence of emphgeneralization failure on controllable states from stranger agents.
We propose a novel method called Self-Trajectory Augmentation (STA), which will reset the environment to the agent's old states according to the Q function during training.
arXiv Detail & Related papers (2023-04-26T10:12:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.