RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning
- URL: http://arxiv.org/abs/2510.02892v1
- Date: Fri, 03 Oct 2025 10:59:26 GMT
- Title: RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning
- Authors: Aleksei Arzhantsev, Otmane Sakhi, Flavian Vasile,
- Abstract summary: Reinforcement learning is central to improving reasoning in large language models (LLMs)<n>We propose RoiRL: Reasoning with offline iterative Reinforcement Learning.<n>We show that RoiRL trains to 2.5x faster and consistently outperforms TTRL on reasoning benchmarks.
- Score: 4.311472216447055
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) is central to improving reasoning in large language models (LLMs) but typically requires ground-truth rewards. Test-Time Reinforcement Learning (TTRL) removes this need by using majority-vote rewards, but relies on heavy online RL and incurs substantial computational cost. We propose RoiRL: Reasoning with offline iterative Reinforcement Learning, a family of lightweight offline learning alternatives that can target the same regularized optimal policies. Unlike TTRL, RoiRL eliminates the need to maintain a reference model and instead optimizes weighted log-likelihood objectives, enabling stable training with significantly lower memory and compute requirements. Experimental results show that RoiRL trains to 2.5x faster and consistently outperforms TTRL on reasoning benchmarks, establishing a scalable path to self-improving LLMs without labels.
Related papers
- Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates [53.3717573880076]
We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates.<n>JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly.<n>Experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods.
arXiv Detail & Related papers (2026-01-26T14:16:51Z) - Transitive RL: Value Learning via Divide and Conquer [54.190627631246166]
Transitive Reinforcement Learning (TRL) is a new value learning algorithm based on a divide-and-conquer paradigm.<n>Unlike Monte Carlo methods, TRL suffers less from high variance as it performs dynamic programming.
arXiv Detail & Related papers (2025-10-26T03:32:31Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning [42.54530036364341]
In environments with sparse rewards, reinforcement learning struggles to sample successful trajectories.<n>We introduce SuperRL, a unified training framework that alternates between RL and SFT.<n>Experiments show that SuperRL surpasses vanilla RL by delivering higher sample efficiency, stronger generalization, and improved robustness under sparse rewards.
arXiv Detail & Related papers (2025-06-01T17:43:54Z) - SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data [65.56911325914582]
We propose Self-play Reinforcement Learning (SeRL) to bootstrap Large Language Models (LLMs) training with limited initial data.<n>The proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards.
arXiv Detail & Related papers (2025-05-25T13:28:04Z) - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z) - TTRL: Test-Time Reinforcement Learning [31.351608137721875]
Test-Time Reinforcement Learning (TTRL) is a novel method for training Large Language Models (LLMs) on unlabeled data.<n>Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models.
arXiv Detail & Related papers (2025-04-22T17:59:56Z) - Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? [67.30809748319486]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs)<n>This study critically examines the current state of RLVR.<n>We find that the current training setup does not elicit fundamentally new reasoning patterns.
arXiv Detail & Related papers (2025-04-18T17:59:56Z) - Unsupervised-to-Online Reinforcement Learning [59.910638327123394]
Unsupervised-to-online RL (U2O RL) replaces domain-specific supervised offline RL with unsupervised offline RL.
U2O RL not only enables reusing a single pre-trained model for multiple downstream tasks, but also learns better representations.
We empirically demonstrate that U2O RL achieves strong performance that matches or even outperforms previous offline-to-online RL approaches.
arXiv Detail & Related papers (2024-08-27T05:23:45Z) - Knowledge Graph Reasoning with Self-supervised Reinforcement Learning [30.359557545737747]
We propose a self-supervised pre-training method to warm up the policy network before the RL training stage.<n>In our supervised learning stage, the agent selects actions based on the policy network and learns from generated labels.<n>We show that our SSRL model meets or exceeds current state-of-the-art results on all Hits@k and mean reciprocal rank (MRR) metrics.
arXiv Detail & Related papers (2024-05-22T13:39:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.