Related papers: Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts

Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts

URL: http://arxiv.org/abs/2601.10079v1
Date: Thu, 15 Jan 2026 05:12:03 GMT
Title: Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts
Authors: Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang, Jinbo Su, Mengshu Sun, Lei Liang, Jing Zhang,
Abstract summary: Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (LLMs)<n>Existing KV compression techniques offer a remedy for inference, but directly applying them to RL training induces a severe policy mismatch.<n>We introduce Sparse-RL, which empowers stable RL training under sparse rollouts.
Score: 27.45707647061042
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (LLMs). However, the substantial memory overhead of storing Key-Value (KV) caches during long-horizon rollouts acts as a critical bottleneck, often prohibiting efficient training on limited hardware. While existing KV compression techniques offer a remedy for inference, directly applying them to RL training induces a severe policy mismatch, leading to catastrophic performance collapse. To address this, we introduce Sparse-RL empowers stable RL training under sparse rollouts. We show that instability arises from a fundamental policy mismatch among the dense old policy, the sparse sampler policy, and the learner policy. To mitigate this issue, Sparse-RL incorporates Sparsity-Aware Rejection Sampling and Importance-based Reweighting to correct the off-policy bias introduced by compression-induced information loss. Experimental results show that Sparse-RL reduces rollout overhead compared to dense baselines while preserving the performance. Furthermore, Sparse-RL inherently implements sparsity-aware training, significantly enhancing model robustness during sparse inference deployment.

Related papers

Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement [21.073482007189504]
Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks.<n> reinforcement learning under verifiable rewards (RLVR) is emerging as a principled framework for aligning model behavior with reasoning chains.<n>Despite its promise, RLVR remains prohibitively resource-intensive, requiring extensive reward signals and incurring substantial rollout costs during training.
arXiv Detail & Related papers (2026-01-31T16:51:50Z)
Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates [53.3717573880076]
We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates.<n>JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly.<n>Experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods.
arXiv Detail & Related papers (2026-01-26T14:16:51Z)
Expressive Value Learning for Scalable Offline Reinforcement Learning [9.946269411850064]
Reinforcement learning (RL) is a powerful paradigm for learning to make sequences of decisions.<n> offline RL offers a promising avenue by training agents on large, diverse datasets.<n>We introduce Expressive Value Learning for Offline Reinforcement Learning (EVOR), a scalable offline RL approach that integrates both expressive policies and expressive value functions.
arXiv Detail & Related papers (2025-10-09T13:42:20Z)
Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [65.14124923451077]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z)
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z)
CAMP in the Odyssey: Provably Robust Reinforcement Learning with Certified Radius Maximization [27.55377940017779]
Deep reinforcement learning (DRL) has gained widespread adoption in control and decision-making tasks due to its strong performance in dynamic environments.<n>Recent efforts have focused on addressing robustness issues by establishing rigorous theoretical guarantees for the returns achieved by DRL agents in adversarial settings.<n>We introduce a novel paradigm dubbed textttCertified-rtextttAdius-textttMaximizing textttPolicy (texttt CAMP) training to enhance DRL policies.
arXiv Detail & Related papers (2025-01-29T14:08:08Z)
Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets. We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged. We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z)
Efficient Adversarial Training without Attacking: Worst-Case-Aware Robust Reinforcement Learning [14.702446153750497]
Worst-case-aware Robust RL (WocaR-RL) is a robust training framework for deep reinforcement learning. We show that WocaR-RL achieves state-of-the-art performance under various strong attacks.
arXiv Detail & Related papers (2022-10-12T05:24:46Z)
FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [54.34189781923818]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment.<n>We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function.<n>We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z)
RORL: Robust Offline Reinforcement Learning via Conservative Smoothing [72.8062448549897]
offline reinforcement learning can exploit the massive amount of offline data for complex decision-making tasks. Current offline RL algorithms are generally designed to be conservative for value estimation and action selection. We propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique.
arXiv Detail & Related papers (2022-06-06T18:07:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.