Related papers: Agentic Reinforcement Learning for Real-World Code Repair

Agentic Reinforcement Learning for Real-World Code Repair

URL: http://arxiv.org/abs/2510.22075v1
Date: Fri, 24 Oct 2025 23:25:02 GMT
Title: Agentic Reinforcement Learning for Real-World Code Repair
Authors: Siyu Zhu, Anastasiya Karpovich, Albert Chen, Jessica Koscheka, Shailesh Jannu, Di Wen, Yuqing Zhu, Rohit Jain, Alborz Geramifard,
Abstract summary: We tackle the challenge of training reliable code-fixing agents in real repositories.<n>We developed a verifiable pipeline with success defined as post-fix build validation.<n>We introduced a scalable simplified pipeline for large-scale reinforcement learning.
Score: 7.512134741776294
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We tackle the challenge of training reliable code-fixing agents in real repositories, where complex builds and shifting dependencies make evaluation unstable. We developed a verifiable pipeline with success defined as post-fix build validation and improved reproducibility across ~1K real issues by pinning dependencies and disabling automatic upgrades. Building on this, we introduced a scalable simplified pipeline for large-scale reinforcement learning (RL). Using this setup, we supervised fine-tuned Qwen3-32B in the full pipeline and applied RL on top of the SFT model in the simplified environment. The SFT model distilled from GPT-4.1 trajectories performs on par while being 56x smaller, and RL added 7-20% absolute gains under matched train-test conditions. "Thinking mode" was on par or worse in our experiments. Both SFT and RL models failed to generalize across environments, highlighting the importance of matching train-test environments for building reliable real-world code-fixing agents.

Related papers

Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning [7.006180736433431]
Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry.<n>We propose a novel textbfAnswer-First, Reason Later (AFRL) paradigm.<n>This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation.
arXiv Detail & Related papers (2026-02-10T17:28:12Z)
CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z)
Trust-Region Adaptive Policy Optimization [82.09255251747818]
Post-training methods play an important role in improving large language models' (LLMs) complex reasoning abilities.<n>We introduce TRAPO, a framework that interleavesSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) within each training instance.<n>Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines.
arXiv Detail & Related papers (2025-12-19T14:37:07Z)
RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments [111.87296453908199]
We introduce Reinforcement Learning with Adaptive Verifiable Environments (RLVE)<n>RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses.<n>We show that environment scaling, i.e., expanding the collection of training environments, consistently improves reasoning capabilities.
arXiv Detail & Related papers (2025-11-10T17:18:35Z)
Scaling Agent Learning via Experience Synthesis [100.42712232390532]
Reinforcement learning can empower autonomous agents by enabling self-improvement through interaction.<n>But its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity.<n>We introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind.
arXiv Detail & Related papers (2025-11-05T18:58:48Z)
Simulating Environments with Reasoning Models for Agent Training [55.98861707136674]
Building bespoke environments for training is heavy, brittle, and limits progress.<n>We propose two frameworks: Simia-SFT and Simia-RL.<n>Simia-SFT and Simia-RL enable scalable agent training without environment engineering.
arXiv Detail & Related papers (2025-11-03T18:29:57Z)
Don't Just Fine-tune the Agent, Tune the Environment [25.7349297100143]
Supervised fine-tuning on synthetic data leads to overfitting.<n>Standard reinforcement learning struggles with a critical cold-start problem and training instability.<n>Our work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration.
arXiv Detail & Related papers (2025-10-11T12:35:15Z)
Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning [6.534445405422796]
Pentest-R1 is a framework designed to optimize reasoning capabilities for penetration testing tasks.<n>It learns directly from environmental feedback to develop robust error self-correction and adaptive strategies.<n>On AutoPenBench, Pentest-R1 achieves a 24.2% success rate, surpassing most state-of-the-art models.
arXiv Detail & Related papers (2025-08-10T15:14:05Z)
Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z)
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z)
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z)
Enhancing Reinforcement Learning for the Floorplanning of Analog ICs with Beam Search [0.32985979395737786]
This paper presents a hybrid method that combines reinforcement learning (RL) with a beam (BS) strategy.<n>The BS algorithm enhances the agent's inference process, allowing for the generation of flexible floorplans.<n> Experimental results show approx. 5-85% improvement in area, dead space and half-perimeter wire length compared to a standard RL application.
arXiv Detail & Related papers (2025-05-08T08:50:32Z)
Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models [12.656574142412484]
We make an attempt to understand the correlation between supervised fine-tuning and reinforcement learning.<n>We find that both atomic and synthetic functions are indispensable for SFT's generalization.
arXiv Detail & Related papers (2024-06-14T03:39:01Z)
Single-Trajectory Distributionally Robust Reinforcement Learning [21.955807398493334]
We propose Distributionally Robust RL (DRRL) to enhance performance across a range of environments. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. We design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ)
arXiv Detail & Related papers (2023-01-27T14:08:09Z)
Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment. We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent. We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.