Scaling Offline RL via Efficient and Expressive Shortcut Models
- URL: http://arxiv.org/abs/2505.22866v1
- Date: Wed, 28 May 2025 20:59:22 GMT
- Title: Scaling Offline RL via Efficient and Expressive Shortcut Models
- Authors: Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kiante Brantley, Wen Sun,
- Abstract summary: offline reinforcement learning (RL) remains challenging due to the iterative nature of their noise sampling processes.<n>We introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models to scale both training and inference.<n>We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute.
- Score: 13.050231036248338
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion and flow models have emerged as powerful generative approaches capable of modeling diverse and multimodal behavior. However, applying these models to offline reinforcement learning (RL) remains challenging due to the iterative nature of their noise sampling processes, making policy optimization difficult. In this paper, we introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models - a novel class of generative models - to scale both training and inference. SORL's policy can capture complex data distributions and can be trained simply and efficiently in a one-stage training procedure. At test time, SORL introduces both sequential and parallel inference scaling by using the learned Q-function as a verifier. We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute. We release the code at nico-espinosadice.github.io/projects/sorl.
Related papers
- Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch [63.40752011615843]
Training tool-augmented language models has emerged as a promising approach to enhancing their capabilities for complex tasks.<n>We propose a dynamic generalization-guided reward design for rule-based reinforcement learning.<n>We show that our models achieve over 7% performance improvement compared to both SFT and RL-with-SFT models.
arXiv Detail & Related papers (2025-11-02T16:33:45Z) - Expressive Value Learning for Scalable Offline Reinforcement Learning [9.946269411850064]
Reinforcement learning (RL) is a powerful paradigm for learning to make sequences of decisions.<n> offline RL offers a promising avenue by training agents on large, diverse datasets.<n>We introduce Expressive Value Learning for Offline Reinforcement Learning (EVOR), a scalable offline RL approach that integrates both expressive policies and expressive value functions.
arXiv Detail & Related papers (2025-10-09T13:42:20Z) - PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation [47.510888611491]
Reinforcement Learning (RL) is increasingly utilized to enhance the reasoning capabilities of Large Language Models (LLMs)<n>This paper introduces PipelineRL, an approach designed to achieve a superior trade-off between hardware efficiency and data on-policyness.
arXiv Detail & Related papers (2025-09-23T15:15:21Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - RAST: Reasoning Activation in LLMs via Small-model Transfer [33.32587030836428]
Reinforcement learning (RL) has become a powerful approach for improving the reasoning capabilities of large language models (LLMs)<n>Applying RL at scale remains intimidatingly resource-intensive, requiring multiple model copies and extensive GPU workloads.<n>We propose RAST, a simple yet effective method that transfers reasoning behaviors by injecting RL-induced probability adjustments from a small RL-trained model into larger models.
arXiv Detail & Related papers (2025-05-30T17:57:08Z) - Normalizing Flows are Capable Models for RL [24.876149287707847]
We propose a single Normalizing Flow architecture which integrates seamlessly into reinforcement learning algorithms.<n>Our approach leads to much simpler algorithms, and achieves higher performance in imitation learning, offline, goal conditioned RL and unsupervised RL.
arXiv Detail & Related papers (2025-05-29T15:06:22Z) - Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.<n>We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z) - Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models [11.624678008637623]
We propose separating generation and learning in RLHF.<n>Online DPO is found to be most robust to off-policy data.<n>Asynchronous training relies on an underexplored regime, online but off-policy RLHF.
arXiv Detail & Related papers (2024-10-23T19:59:50Z) - Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning [6.345851712811528]
We introduce a novel model-based offline RL method, Lower Expectile Q-learning (LEQ)<n>LEQ provides a low-bias model-based value estimation via lower expectile regression of $lambda$-returns.<n>Our studies demonstrate that lower expectile regression, $lambda$-returns, and critic training on offline data are all crucial for LEQ.
arXiv Detail & Related papers (2024-06-30T13:44:59Z) - Compressing Deep Reinforcement Learning Networks with a Dynamic
Structured Pruning Method for Autonomous Driving [63.155562267383864]
Deep reinforcement learning (DRL) has shown remarkable success in complex autonomous driving scenarios.
DRL models inevitably bring high memory consumption and computation, which hinders their wide deployment in resource-limited autonomous driving devices.
We introduce a novel dynamic structured pruning approach that gradually removes a DRL model's unimportant neurons during the training stage.
arXiv Detail & Related papers (2024-02-07T09:00:30Z) - Action-Quantized Offline Reinforcement Learning for Robotic Skill
Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data.
In this paper, we propose an adaptive scheme for action quantization.
We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z) - Bootstrapped Transformer for Offline Reinforcement Learning [31.43012728924881]
offline reinforcement learning (RL) aims at learning policies from previously collected static trajectory data without interacting with the real environment.
Recent works provide a novel perspective by viewing offline RL as a generic sequence generation problem.
We propose a novel algorithm named Bootstrapped Transformer, which incorporates the idea of bootstrapping and leverages the learned model to self-generate more offline data.
arXiv Detail & Related papers (2022-06-17T05:57:47Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - POAR: Efficient Policy Optimization via Online Abstract State
Representation Learning [6.171331561029968]
State Representation Learning (SRL) is proposed to specifically learn to encode task-relevant features from complex sensory data into low-dimensional states.
We introduce a new SRL prior called domain resemblance to leverage expert demonstration to improve SRL interpretations.
We empirically verify POAR to efficiently handle tasks in high dimensions and facilitate training real-life robots directly from scratch.
arXiv Detail & Related papers (2021-09-17T16:52:03Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z) - Reinforcement Learning as One Big Sequence Modeling Problem [84.84564880157149]
Reinforcement learning (RL) is typically concerned with estimating single-step policies or single-step models.
We view RL as a sequence modeling problem, with the goal being to predict a sequence of actions that leads to a sequence of high rewards.
arXiv Detail & Related papers (2021-06-03T17:58:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.