Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts
- URL: http://arxiv.org/abs/2601.10079v1
- Date: Thu, 15 Jan 2026 05:12:03 GMT
- Title: Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts
- Authors: Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang, Jinbo Su, Mengshu Sun, Lei Liang, Jing Zhang,
- Abstract summary: Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (LLMs)<n>Existing KV compression techniques offer a remedy for inference, but directly applying them to RL training induces a severe policy mismatch.<n>We introduce Sparse-RL, which empowers stable RL training under sparse rollouts.
- Score: 27.45707647061042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (LLMs). However, the substantial memory overhead of storing Key-Value (KV) caches during long-horizon rollouts acts as a critical bottleneck, often prohibiting efficient training on limited hardware. While existing KV compression techniques offer a remedy for inference, directly applying them to RL training induces a severe policy mismatch, leading to catastrophic performance collapse. To address this, we introduce Sparse-RL empowers stable RL training under sparse rollouts. We show that instability arises from a fundamental policy mismatch among the dense old policy, the sparse sampler policy, and the learner policy. To mitigate this issue, Sparse-RL incorporates Sparsity-Aware Rejection Sampling and Importance-based Reweighting to correct the off-policy bias introduced by compression-induced information loss. Experimental results show that Sparse-RL reduces rollout overhead compared to dense baselines while preserving the performance. Furthermore, Sparse-RL inherently implements sparsity-aware training, significantly enhancing model robustness during sparse inference deployment.
Related papers
- Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement [21.073482007189504]
Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks.<n> reinforcement learning under verifiable rewards (RLVR) is emerging as a principled framework for aligning model behavior with reasoning chains.<n>Despite its promise, RLVR remains prohibitively resource-intensive, requiring extensive reward signals and incurring substantial rollout costs during training.
arXiv Detail & Related papers (2026-01-31T16:51:50Z) - Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates [53.3717573880076]
We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates.<n>JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly.<n>Experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods.
arXiv Detail & Related papers (2026-01-26T14:16:51Z) - Expressive Value Learning for Scalable Offline Reinforcement Learning [9.946269411850064]
Reinforcement learning (RL) is a powerful paradigm for learning to make sequences of decisions.<n> offline RL offers a promising avenue by training agents on large, diverse datasets.<n>We introduce Expressive Value Learning for Offline Reinforcement Learning (EVOR), a scalable offline RL approach that integrates both expressive policies and expressive value functions.
arXiv Detail & Related papers (2025-10-09T13:42:20Z) - Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [65.14124923451077]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - CAMP in the Odyssey: Provably Robust Reinforcement Learning with Certified Radius Maximization [27.55377940017779]
Deep reinforcement learning (DRL) has gained widespread adoption in control and decision-making tasks due to its strong performance in dynamic environments.<n>Recent efforts have focused on addressing robustness issues by establishing rigorous theoretical guarantees for the returns achieved by DRL agents in adversarial settings.<n>We introduce a novel paradigm dubbed textttCertified-rtextttAdius-textttMaximizing textttPolicy (texttt CAMP) training to enhance DRL policies.
arXiv Detail & Related papers (2025-01-29T14:08:08Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Efficient Adversarial Training without Attacking: Worst-Case-Aware
Robust Reinforcement Learning [14.702446153750497]
Worst-case-aware Robust RL (WocaR-RL) is a robust training framework for deep reinforcement learning.
We show that WocaR-RL achieves state-of-the-art performance under various strong attacks.
arXiv Detail & Related papers (2022-10-12T05:24:46Z) - FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [54.34189781923818]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment.<n>We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function.<n>We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z) - RORL: Robust Offline Reinforcement Learning via Conservative Smoothing [72.8062448549897]
offline reinforcement learning can exploit the massive amount of offline data for complex decision-making tasks.
Current offline RL algorithms are generally designed to be conservative for value estimation and action selection.
We propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique.
arXiv Detail & Related papers (2022-06-06T18:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.