Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent
- URL: http://arxiv.org/abs/2512.14990v1
- Date: Wed, 17 Dec 2025 00:50:58 GMT
- Title: Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent
- Authors: Mehil B Shah, Mohammad Masudur Rahman, Foutse Khomh,
- Abstract summary: RepGen is a novel, automated, and intelligent approach for reproducing deep learning bugs.<n>We evaluate RepGen on 106 real-world deep learning bugs and achieve a reproduction rate of 80.19%.
- Score: 6.992405861720876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite their wide adoption in various domains (e.g., healthcare, finance, software engineering), Deep Learning (DL)-based applications suffer from many bugs, failures, and vulnerabilities. Reproducing these bugs is essential for their resolution, but it is extremely challenging due to the inherent nondeterminism of DL models and their tight coupling with hardware and software environments. According to recent studies, only about 3% of DL bugs can be reliably reproduced using manual approaches. To address these challenges, we present RepGen, a novel, automated, and intelligent approach for reproducing deep learning bugs. RepGen constructs a learning-enhanced context from a project, develops a comprehensive plan for bug reproduction, employs an iterative generate-validate-refine mechanism, and thus generates such code using an LLM that reproduces the bug at hand. We evaluate RepGen on 106 real-world deep learning bugs and achieve a reproduction rate of 80.19%, a 19.81% improvement over the state-of-the-art measure. A developer study involving 27 participants shows that RepGen improves the success rate of DL bug reproduction by 23.35%, reduces the time to reproduce by 56.8%, and lowers participants' cognitive load.
Related papers
- Test-time Recursive Thinking: Self-Improvement without External Feedback [120.80790108733942]
Test-time Recursive Thinking (TRT) is an iterative self-improvement framework.<n>Open-source models reach 100% accuracy on AIME-25/24, and on LiveCodeBench's most difficult problems, closed-source models improve by 10.4-14.8 percentage points without external feedback.
arXiv Detail & Related papers (2026-02-03T04:37:37Z) - Toward Training Superintelligent Software Agents through Self-Play SWE-RL [66.11447353341926]
Self-play SWE-RL is a first step toward training paradigms for superintelligent software agents.<n>Our approach takes minimal data assumptions, only requiring access to sandboxed repositories with source code and installed dependencies.<n>Our results, albeit early, suggest a path where agents autonomously gather extensive learning experiences from real-world software repositories.
arXiv Detail & Related papers (2025-12-21T00:49:40Z) - RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks [75.52891348667491]
Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics.<n>The sheer number of relevant rubrics leads to prohibitively high verification costs and incomplete assessments of a response.<n>We propose Reinforcement Learning with Adrial Critic (RLAC), a post-training approach that addresses these challenges via dynamic rubric verification.
arXiv Detail & Related papers (2025-11-03T17:15:05Z) - BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [59.003563837981886]
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents.<n>We introduce a novel method for synthetic generation of difficult and diverse bugs.
arXiv Detail & Related papers (2025-10-22T17:58:56Z) - BugScope: Learn to Find Bugs Like Human [9.05553442116139]
BugScope emulates how human auditors learn new bug patterns from representative examples and apply that knowledge during code auditing.<n>Our evaluation on a dataset of 40 real-world bugs drawn from 21 widely-used open-source projects demonstrates that BugScope achieves 87.04% precision.<n>Further testing on large-scale open-source systems, including the Linux kernel, uncovered 141 previously unknown bugs.
arXiv Detail & Related papers (2025-07-21T14:34:01Z) - BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis [1.9291502706655312]
We introduce BugGen, a first of its kind, fully autonomous, multi-agent pipeline to generate, insert, and validate functional bugs in RTL.<n> BugGen partitions modules, selects mutation targets via a closed-loop agentic architecture, and employs iterative refinement and rollback mechanisms.<n> evaluated across five OpenTitan IP blocks, BugGen produced 500 unique bugs with 94% functional accuracy and achieved a throughput of 17.7 validated bugs per hour-over five times faster than typical manual expert insertion.
arXiv Detail & Related papers (2025-06-12T09:02:20Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - LLMs as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues [62.12404317786005]
EvoCoder is a continuous learning framework for issue code reproduction.
Our results show a 20% improvement in issue reproduction rates over existing SOTA methods.
arXiv Detail & Related papers (2024-11-21T08:49:23Z) - Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical Study [13.17302533571231]
This paper examines the critical issue of reproducing deep learning bugs.
We identify edit actions and useful information that could improve the critical issue.
We successfully reproduced 148 out of 165 bugs attempted.
arXiv Detail & Related papers (2024-01-05T21:30:13Z) - Prompting Is All You Need: Automated Android Bug Replay with Large Language Models [28.69675481931385]
We propose AdbGPT, a new lightweight approach to automatically reproduce the bugs from bug reports through prompt engineering.
AdbGPT leverages few-shot learning and chain-of-thought reasoning to elicit human knowledge and logical reasoning from LLMs.
Our evaluations demonstrate the effectiveness and efficiency of our AdbGPT to reproduce 81.3% of bug reports in 253.6 seconds.
arXiv Detail & Related papers (2023-06-03T03:03:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.