Agentic Bug Reproduction for Effective Automated Program Repair at Google
- URL: http://arxiv.org/abs/2502.01821v1
- Date: Mon, 03 Feb 2025 20:57:17 GMT
- Title: Agentic Bug Reproduction for Effective Automated Program Repair at Google
- Authors: Runxiang Cheng, Michele Tufano, Jürgen Cito, José Cambronero, Pat Rondon, Renyao Wei, Aaron Sun, Satish Chandra,
- Abstract summary: This paper investigates automated BRT generation within an industry setting, specifically at Google.
We adapt and evaluate a state-of-the-art BRT generation technique, LIBRO, and present our agent-based approach, BRT Agent.
Our results show that providing BRTs to the APR system results in 30% more bugs with plausible fixes.
- Score: 9.64193881099048
- License:
- Abstract: Bug reports often lack sufficient detail for developers to reproduce and fix the underlying defects. Bug Reproduction Tests (BRTs), tests that fail when the bug is present and pass when it has been resolved, are crucial for debugging, but they are rarely included in bug reports, both in open-source and in industrial settings. Thus, automatically generating BRTs from bug reports has the potential to accelerate the debugging process and lower time to repair. This paper investigates automated BRT generation within an industry setting, specifically at Google, focusing on the challenges of a large-scale, proprietary codebase and considering real-world industry bugs extracted from Google's internal issue tracker. We adapt and evaluate a state-of-the-art BRT generation technique, LIBRO, and present our agent-based approach, BRT Agent, which makes use of a fine-tuned Large Language Model (LLM) for code editing. Our BRT Agent significantly outperforms LIBRO, achieving a 28% plausible BRT generation rate, compared to 10% by LIBRO, on 80 human-reported bugs from Google's internal issue tracker. We further investigate the practical value of generated BRTs by integrating them with an Automated Program Repair (APR) system at Google. Our results show that providing BRTs to the APR system results in 30% more bugs with plausible fixes. Additionally, we introduce Ensemble Pass Rate (EPR), a metric which leverages the generated BRTs to select the most promising fixes from all fixes generated by APR system. Our evaluation on EPR for Top-K and threshold-based fix selections demonstrates promising results and trade-offs. For example, EPR correctly selects a plausible fix from a pool of 20 candidates in 70% of cases, based on its top-1 ranking.
Related papers
- Evaluating Agent-based Program Repair at Google [9.62742759337993]
Agent-based program repair offers to automatically resolve complex bugs end-to-end.
Recent work has explored the use of agent-based repair approaches on the popular open-source SWE-Bench.
This paper explores the viability of using an agentic approach to address bugs in an enterprise context.
arXiv Detail & Related papers (2025-01-13T18:09:25Z) - ContrastRepair: Enhancing Conversation-Based Automated Program Repair
via Contrastive Test Case Pairs [23.419180504723546]
ContrastRepair is a novel APR approach that augments conversation-driven APR by providing contrastive test pairs.
We evaluate ContrastRepair on multiple benchmark datasets, including Defects4j, QuixBugs, and HumanEval-Java.
arXiv Detail & Related papers (2024-03-04T12:15:28Z) - A Novel Approach for Automatic Program Repair using Round-Trip
Translation with Large Language Models [50.86686630756207]
Research shows that grammatical mistakes in a sentence can be corrected by translating it to another language and back.
Current generative models for Automatic Program Repair (APR) are pre-trained on source code and fine-tuned for repair.
This paper proposes bypassing the fine-tuning step and using Round-Trip Translation (RTT): translation of code from one programming language to another programming or natural language, and back.
arXiv Detail & Related papers (2024-01-15T22:36:31Z) - RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic
Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen)
RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs.
We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z) - Prompting Is All You Need: Automated Android Bug Replay with Large Language Models [28.69675481931385]
We propose AdbGPT, a new lightweight approach to automatically reproduce the bugs from bug reports through prompt engineering.
AdbGPT leverages few-shot learning and chain-of-thought reasoning to elicit human knowledge and logical reasoning from LLMs.
Our evaluations demonstrate the effectiveness and efficiency of our AdbGPT to reproduce 81.3% of bug reports in 253.6 seconds.
arXiv Detail & Related papers (2023-06-03T03:03:52Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Improving Automated Program Repair with Domain Adaptation [0.0]
Automated Program Repair (APR) is defined as the process of fixing a bug/defect in the source code, by an automated tool.
APR tools have recently experienced promising results by leveraging state-of-the-art Neural Language Processing (NLP) techniques.
arXiv Detail & Related papers (2022-12-21T23:52:09Z) - Large Language Models are Few-shot Testers: Exploring LLM-based General
Bug Reproduction [14.444294152595429]
The number of tests added in open source repositories due to issues was about 28% of the corresponding project test suite size.
We propose LIBRO, a framework that uses Large Language Models (LLMs), which have been shown to be capable of performing code-related tasks.
Our evaluation of LIBRO shows that, on the widely studied Defects4J benchmark, LIBRO can generate failure reproducing test cases for 33% of all studied cases.
arXiv Detail & Related papers (2022-09-23T10:50:47Z) - BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization.
We provide a general benchmark with a diversity of real and synthetic Java bugs.
We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z) - SUPERNOVA: Automating Test Selection and Defect Prevention in AAA Video
Games Using Risk Based Testing and Machine Learning [62.997667081978825]
Testing video games is an increasingly difficult task as traditional methods fail to scale with growing software systems.
We present SUPERNOVA, a system responsible for test selection and defect prevention while also functioning as an automation hub.
The direct impact of this has been observed to be a reduction in 55% or more testing hours for an undisclosed sports game title.
arXiv Detail & Related papers (2022-03-10T00:47:46Z) - Break-It-Fix-It: Unsupervised Learning for Program Repair [90.55497679266442]
We propose a new training approach, Break-It-Fix-It (BIFI), which has two key ideas.
We use the critic to check a fixer's output on real bad inputs and add good (fixed) outputs to the training data.
Based on these ideas, we iteratively update the breaker and the fixer while using them in conjunction to generate more paired data.
BIFI outperforms existing methods, obtaining 90.5% repair accuracy on GitHub-Python and 71.7% on DeepFix.
arXiv Detail & Related papers (2021-06-11T20:31:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.