Related papers: GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies

GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies

URL: http://arxiv.org/abs/2512.02581v1
Date: Tue, 02 Dec 2025 09:49:26 GMT
Title: GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies
Authors: Chubin Zhang, Zhenglin Wan, Feng Chen, Xingrui Yu, Ivor Tsang, Bo An,
Abstract summary: GoRL is a framework that optimize a tractable latent policy while utilizing a conditional generative decoder to synthesize actions.<n>GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines.
Score: 16.859964356466676
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.

Related papers

Dichotomous Diffusion Policy Optimization [46.51375996317989]
DIPOLE is a novel RL algorithm designed for stable and controllable diffusion policy optimization.<n>We also use DIPOLE to train a large vision-language-action model for end-to-end autonomous driving.
arXiv Detail & Related papers (2025-12-31T16:56:56Z)
Offline Reinforcement Learning with Generative Trajectory Policies [6.501269050121785]
Generative models have emerged as a powerful class of policies for offline reinforcement learning.<n>Existing methods face a stark trade-off: slow, iterative models are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance.<n>We propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE.
arXiv Detail & Related papers (2025-10-13T15:06:28Z)
Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards [47.557539197058496]
We introduce Random Policy Valuation for Diverse Reasoning (ROVER)<n>ROVER is a minimalist yet highly effective RL method that samples actions from a softmax over uniform-policy Q-values.<n>It demonstrates superior performance in both textbfquality (textbf+8.2 on pass@1, textbf+16.8 on pass@256) and textbfdiversity (textbf+17.6%)
arXiv Detail & Related papers (2025-09-29T16:09:07Z)
EXPO: Stable Reinforcement Learning with Expressive Policies [74.30151915786233]
We propose a sample-efficient online reinforcement learning algorithm to maximize value with two parameterized policies.<n>Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods.
arXiv Detail & Related papers (2025-07-10T17:57:46Z)
Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning [0.0]
We propose a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples.<n>Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability.<n>We extend SSCP to goal-conditioned RL, enabling flat policies to exploit subgoal structures without explicit hierarchical inference.
arXiv Detail & Related papers (2025-06-26T16:09:53Z)
GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning [34.25769740497309]
GenPO is a generative policy optimization framework that leverages exact diffusion inversion to construct invertible action mappings.<n>GenPO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.
arXiv Detail & Related papers (2025-05-24T15:57:07Z)
Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone [72.17534881026995]
We develop an offline and online fine-tuning approach called policy-agnostic RL (PA-RL)<n>We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm.
arXiv Detail & Related papers (2024-12-09T17:28:03Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
Offline Reinforcement Learning with Closed-Form Policy Improvement Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. In this paper, we propose our closed-form policy improvement operators. We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z)
Offline RL With Realistic Datasets: Heteroskedasticity and Support Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability. Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z)
MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. We show that an existing model-based RL algorithm already produces significant gains in the offline setting. We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.