Related papers: RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning

RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning

URL: http://arxiv.org/abs/2510.02892v1
Date: Fri, 03 Oct 2025 10:59:26 GMT
Title: RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning
Authors: Aleksei Arzhantsev, Otmane Sakhi, Flavian Vasile,
Abstract summary: Reinforcement learning is central to improving reasoning in large language models (LLMs)<n>We propose RoiRL: Reasoning with offline iterative Reinforcement Learning.<n>We show that RoiRL trains to 2.5x faster and consistently outperforms TTRL on reasoning benchmarks.
Score: 4.311472216447055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) is central to improving reasoning in large language models (LLMs) but typically requires ground-truth rewards. Test-Time Reinforcement Learning (TTRL) removes this need by using majority-vote rewards, but relies on heavy online RL and incurs substantial computational cost. We propose RoiRL: Reasoning with offline iterative Reinforcement Learning, a family of lightweight offline learning alternatives that can target the same regularized optimal policies. Unlike TTRL, RoiRL eliminates the need to maintain a reference model and instead optimizes weighted log-likelihood objectives, enabling stable training with significantly lower memory and compute requirements. Experimental results show that RoiRL trains to 2.5x faster and consistently outperforms TTRL on reasoning benchmarks, establishing a scalable path to self-improving LLMs without labels.

Related papers

Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates [53.3717573880076]
We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates.<n>JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly.<n>Experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods.
arXiv Detail & Related papers (2026-01-26T14:16:51Z)
Transitive RL: Value Learning via Divide and Conquer [54.190627631246166]
Transitive Reinforcement Learning (TRL) is a new value learning algorithm based on a divide-and-conquer paradigm.<n>Unlike Monte Carlo methods, TRL suffers less from high variance as it performs dynamic programming.
arXiv Detail & Related papers (2025-10-26T03:32:31Z)
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z)
SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning [42.54530036364341]
In environments with sparse rewards, reinforcement learning struggles to sample successful trajectories.<n>We introduce SuperRL, a unified training framework that alternates between RL and SFT.<n>Experiments show that SuperRL surpasses vanilla RL by delivering higher sample efficiency, stronger generalization, and improved robustness under sparse rewards.
arXiv Detail & Related papers (2025-06-01T17:43:54Z)
SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data [65.56911325914582]
We propose Self-play Reinforcement Learning (SeRL) to bootstrap Large Language Models (LLMs) training with limited initial data.<n>The proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards.
arXiv Detail & Related papers (2025-05-25T13:28:04Z)
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z)
TTRL: Test-Time Reinforcement Learning [31.351608137721875]
Test-Time Reinforcement Learning (TTRL) is a novel method for training Large Language Models (LLMs) on unlabeled data.<n>Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models.
arXiv Detail & Related papers (2025-04-22T17:59:56Z)
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? [67.30809748319486]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs)<n>This study critically examines the current state of RLVR.<n>We find that the current training setup does not elicit fundamentally new reasoning patterns.
arXiv Detail & Related papers (2025-04-18T17:59:56Z)
Unsupervised-to-Online Reinforcement Learning [59.910638327123394]
Unsupervised-to-online RL (U2O RL) replaces domain-specific supervised offline RL with unsupervised offline RL. U2O RL not only enables reusing a single pre-trained model for multiple downstream tasks, but also learns better representations. We empirically demonstrate that U2O RL achieves strong performance that matches or even outperforms previous offline-to-online RL approaches.
arXiv Detail & Related papers (2024-08-27T05:23:45Z)
Knowledge Graph Reasoning with Self-supervised Reinforcement Learning [30.359557545737747]
We propose a self-supervised pre-training method to warm up the policy network before the RL training stage.<n>In our supervised learning stage, the agent selects actions based on the policy network and learns from generated labels.<n>We show that our SSRL model meets or exceeds current state-of-the-art results on all Hits@k and mean reciprocal rank (MRR) metrics.
arXiv Detail & Related papers (2024-05-22T13:39:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.