RL in the Wild: Characterizing RLVR Training in LLM Deployment
- URL: http://arxiv.org/abs/2509.25279v2
- Date: Mon, 13 Oct 2025 05:01:17 GMT
- Title: RL in the Wild: Characterizing RLVR Training in LLM Deployment
- Authors: Jiecheng Zhou, Qinghao Hu, Yuyang Jin, Zerui Wang, Peng Sun, Yuzhe Gu, Wenwei Zhang, Mingshu Zhai, Xingcheng Zhang, Weiming Zhang,
- Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) has surged in recent months to enhance their reasoning and understanding abilities.<n>However, its complex data flows and diverse tasks pose substantial challenges to RL training systems.<n>There is limited understanding of RLVR from a system perspective.
- Score: 43.81962834561768
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large Language Models (LLMs) are now widely used across many domains. With their rapid development, Reinforcement Learning with Verifiable Rewards (RLVR) has surged in recent months to enhance their reasoning and understanding abilities. However, its complex data flows and diverse tasks pose substantial challenges to RL training systems, and there is limited understanding of RLVR from a system perspective. To thoroughly understand the system challenges introduced by RLVR, we present a characterization study of RLVR tasks in our LLM deployment. Specifically, we investigate the distribution and variation trends of workloads across different RL tasks across training steps. We identify issues such as GPU idling caused by skewed sequence length distribution, inefficient parallel strategies in dynamically varying workloads, inefficient data management mechanisms, and load imbalance. We describe our observations and call for further investigation into the remaining open challenges. Furthermore, we propose PolyTrace benchmark suite to conduct evaluation with realistic workloads, and a practical use case validates that PolyTrace benchmark suite exhibits 94.7% accuracy.
Related papers
- Not All Steps are Informative: On the Linearity of LLMs' RLVR Training [14.59942263367421]
Reinforcement learning with verifiable rewards (RLVR) has become a central component of large language model (LLM) post-training.<n>We investigate whether future model states can be predicted from intermediate checkpoints via extrapolation, avoiding continued expensive training.<n>We show that Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation.
arXiv Detail & Related papers (2026-01-08T03:06:18Z) - Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning [49.22815446849924]
Large Language Models (LLMs) often struggle with problems that require multi-step reasoning.<n>For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled.<n>We propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions"
arXiv Detail & Related papers (2025-10-29T22:05:08Z) - Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle [66.80133103857703]
Reinforcement Learning (RL) has markedly enhanced the reasoning and alignment performance of Large Language Models (LLMs)<n>This survey aims to present researchers and practitioners with the latest developments and frontier trends at the intersection of RL and LLMs.
arXiv Detail & Related papers (2025-09-20T13:11:28Z) - RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z) - A Survey of Continual Reinforcement Learning [37.12149196139624]
Reinforcement Learning (RL) is an important machine learning paradigm for solving sequential decision-making problems.<n>RL's limited ability to generalize across tasks restricts its applicability in dynamic and real-world environments.<n>Continual Reinforcement Learning (CRL) has emerged as a promising research direction to address these limitations.
arXiv Detail & Related papers (2025-06-27T03:10:20Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs [12.087316618902433]
Reasoning large language models (LLMs) excel in complex tasks.<n>Existing approaches allocate an equal number of rollouts to all questions during reinforcement learning (RL)<n>We propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems.
arXiv Detail & Related papers (2025-05-24T07:28:29Z) - Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.<n>We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - The Surprising Ineffectiveness of Pre-Trained Visual Representations for Model-Based Reinforcement Learning [8.36595587335589]
Visual Reinforcement Learning methods often require extensive amounts of data.<n>Model-based RL (MBRL) offers a potential solution with efficient data utilization through planning.<n>MBRL lacks generalization capabilities for real-world tasks.
arXiv Detail & Related papers (2024-11-15T13:21:26Z) - A Tutorial on Meta-Reinforcement Learning [69.76165430793571]
We cast the development of better RL algorithms as a machine learning problem itself in a process called meta-RL.<n>We discuss how, at a high level, meta-RL research can be clustered based on the presence of a task distribution and the learning budget available for each individual task.<n>We conclude by presenting the open problems on the path to making meta-RL part of the standard toolbox for a deep RL practitioner.
arXiv Detail & Related papers (2023-01-19T12:01:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.