RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments
- URL: http://arxiv.org/abs/2511.07317v1
- Date: Mon, 10 Nov 2025 17:18:35 GMT
- Title: RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments
- Authors: Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, Hannaneh Hajishirzi,
- Abstract summary: We introduce Reinforcement Learning with Adaptive Verifiable Environments (RLVE)<n>RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses.<n>We show that environment scaling, i.e., expanding the collection of training environments, consistently improves reasoning capabilities.
- Score: 111.87296453908199
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses. In contrast, static data distributions often lead to vanishing learning signals when problems are either too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a large-scale suite of 400 verifiable environments carefully developed through manual environment engineering. Using RLVE-Gym, we show that environment scaling, i.e., expanding the collection of training environments, consistently improves generalizable reasoning capabilities. RLVE with joint training across all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement across six reasoning benchmarks, starting from one of the strongest 1.5B reasoning LMs. By comparison, continuing this LM's original RL training yields only a 0.49% average absolute gain despite using over 3x more compute. We release our code publicly.
Related papers
- RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System [52.3348044324205]
We propose RLAnything, a reinforcement learning framework that forges environment, policy, and reward models through closed-loop optimization.<n>Specifically, the policy is trained with integrated feedback from step-wise and outcome signals.<n>Our theory-motivated automatic environment adaptation improves training for both the reward and policy models.
arXiv Detail & Related papers (2026-02-02T18:59:04Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning [23.795932850992816]
We present R1-Code-Interpreter, an extension of a text-only Large Language Models (LLMs) trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL)<n>We show that training a general-purpose Code Interpreter across 144 diverse reasoning and planning tasks presents significant challenges due to task heterogeneity and scarcity of effective samples.<n>Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.1% to 72.4%, outperforming text-only GPT-4o (58.6%) and GPT-4o with Code Interpreter (70.9%).
arXiv Detail & Related papers (2025-05-27T18:47:33Z) - RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning [125.96848846966087]
Training large language models (LLMs) as interactive agents presents unique challenges.<n>While reinforcement learning has enabled progress in static tasks, multi-turn agent RL training remains underexplored.<n>We propose StarPO, a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents.
arXiv Detail & Related papers (2025-04-24T17:57:08Z) - SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [46.5893728376551]
This paper introduces SWE-RL, the first approach to scale RL-based large language models (LLMs) for real-world software engineering.<n>Llama3-SWE-RL-70B achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues.<n>Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills.
arXiv Detail & Related papers (2025-02-25T18:45:04Z) - DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning [61.10299147201369]
This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents.
We build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator.
We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement.
arXiv Detail & Related papers (2024-06-14T17:49:55Z) - RL4CO: an Extensive Reinforcement Learning for Combinatorial Optimization Benchmark [69.19502244910632]
Combinatorial optimization (CO) is fundamental to several real-world applications, from logistics and scheduling to hardware design and resource allocation.<n>Deep reinforcement learning has recently shown significant benefits in solving CO problems, reducing reliance on domain expertise and improving computational efficiency.<n>We introduce RL4CO, a unified benchmark with in-depth library coverage of 27 CO problem environments and 23 state-of-the-art baselines.
arXiv Detail & Related papers (2023-06-29T16:57:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.