Agentic Reinforcement Learning for Real-World Code Repair
- URL: http://arxiv.org/abs/2510.22075v1
- Date: Fri, 24 Oct 2025 23:25:02 GMT
- Title: Agentic Reinforcement Learning for Real-World Code Repair
- Authors: Siyu Zhu, Anastasiya Karpovich, Albert Chen, Jessica Koscheka, Shailesh Jannu, Di Wen, Yuqing Zhu, Rohit Jain, Alborz Geramifard,
- Abstract summary: We tackle the challenge of training reliable code-fixing agents in real repositories.<n>We developed a verifiable pipeline with success defined as post-fix build validation.<n>We introduced a scalable simplified pipeline for large-scale reinforcement learning.
- Score: 7.512134741776294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We tackle the challenge of training reliable code-fixing agents in real repositories, where complex builds and shifting dependencies make evaluation unstable. We developed a verifiable pipeline with success defined as post-fix build validation and improved reproducibility across ~1K real issues by pinning dependencies and disabling automatic upgrades. Building on this, we introduced a scalable simplified pipeline for large-scale reinforcement learning (RL). Using this setup, we supervised fine-tuned Qwen3-32B in the full pipeline and applied RL on top of the SFT model in the simplified environment. The SFT model distilled from GPT-4.1 trajectories performs on par while being 56x smaller, and RL added 7-20% absolute gains under matched train-test conditions. "Thinking mode" was on par or worse in our experiments. Both SFT and RL models failed to generalize across environments, highlighting the importance of matching train-test environments for building reliable real-world code-fixing agents.
Related papers
- Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning [7.006180736433431]
Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry.<n>We propose a novel textbfAnswer-First, Reason Later (AFRL) paradigm.<n>This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation.
arXiv Detail & Related papers (2026-02-10T17:28:12Z) - CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z) - Trust-Region Adaptive Policy Optimization [82.09255251747818]
Post-training methods play an important role in improving large language models' (LLMs) complex reasoning abilities.<n>We introduce TRAPO, a framework that interleavesSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) within each training instance.<n>Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines.
arXiv Detail & Related papers (2025-12-19T14:37:07Z) - RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments [111.87296453908199]
We introduce Reinforcement Learning with Adaptive Verifiable Environments (RLVE)<n>RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses.<n>We show that environment scaling, i.e., expanding the collection of training environments, consistently improves reasoning capabilities.
arXiv Detail & Related papers (2025-11-10T17:18:35Z) - Scaling Agent Learning via Experience Synthesis [100.42712232390532]
Reinforcement learning can empower autonomous agents by enabling self-improvement through interaction.<n>But its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity.<n>We introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind.
arXiv Detail & Related papers (2025-11-05T18:58:48Z) - Simulating Environments with Reasoning Models for Agent Training [55.98861707136674]
Building bespoke environments for training is heavy, brittle, and limits progress.<n>We propose two frameworks: Simia-SFT and Simia-RL.<n>Simia-SFT and Simia-RL enable scalable agent training without environment engineering.
arXiv Detail & Related papers (2025-11-03T18:29:57Z) - Don't Just Fine-tune the Agent, Tune the Environment [25.7349297100143]
Supervised fine-tuning on synthetic data leads to overfitting.<n>Standard reinforcement learning struggles with a critical cold-start problem and training instability.<n>Our work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration.
arXiv Detail & Related papers (2025-10-11T12:35:15Z) - Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning [6.534445405422796]
Pentest-R1 is a framework designed to optimize reasoning capabilities for penetration testing tasks.<n>It learns directly from environmental feedback to develop robust error self-correction and adaptive strategies.<n>On AutoPenBench, Pentest-R1 achieves a 24.2% success rate, surpassing most state-of-the-art models.
arXiv Detail & Related papers (2025-08-10T15:14:05Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z) - Enhancing Reinforcement Learning for the Floorplanning of Analog ICs with Beam Search [0.32985979395737786]
This paper presents a hybrid method that combines reinforcement learning (RL) with a beam (BS) strategy.<n>The BS algorithm enhances the agent's inference process, allowing for the generation of flexible floorplans.<n> Experimental results show approx. 5-85% improvement in area, dead space and half-perimeter wire length compared to a standard RL application.
arXiv Detail & Related papers (2025-05-08T08:50:32Z) - Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models [12.656574142412484]
We make an attempt to understand the correlation between supervised fine-tuning and reinforcement learning.<n>We find that both atomic and synthetic functions are indispensable for SFT's generalization.
arXiv Detail & Related papers (2024-06-14T03:39:01Z) - Single-Trajectory Distributionally Robust Reinforcement Learning [21.955807398493334]
We propose Distributionally Robust RL (DRRL) to enhance performance across a range of environments.
Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory.
We design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ)
arXiv Detail & Related papers (2023-01-27T14:08:09Z) - Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment.
We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent.
We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.