Latent Policy Steering through One-Step Flow Policies
- URL: http://arxiv.org/abs/2603.05296v1
- Date: Thu, 05 Mar 2026 15:38:08 GMT
- Title: Latent Policy Steering through One-Step Flow Policies
- Authors: Hokyun Im, Andrey Kolobov, Jianlong Fu, Youngwoon Lee,
- Abstract summary: offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration.<n>Latent Policy Steering (LPS) enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy.<n>Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.
- Score: 34.06099184809882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.
Related papers
- ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation [20.162114513881118]
offline reinforcement learning (RL) aims to learn the optimal policy from a fixed dataset generated by behavior policies without additional environment interactions.<n>We propose ReFORM, an offline RL method based on flow policies that enforces the less restrictive support constraint by construction.
arXiv Detail & Related papers (2026-02-04T21:03:11Z) - Q-learning with Adjoint Matching [58.78551025170267]
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm.<n>QAM sidesteps two challenges by leveraging adjoint matching, a recently proposed technique in generative modeling.<n>It consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
arXiv Detail & Related papers (2026-01-20T18:45:34Z) - Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning [10.037416068775853]
We introduce Guided Flow Policy, which couples a multi-step flow-matching policy with a distilled one-step actor.<n>The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset.<n>This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks.
arXiv Detail & Related papers (2025-12-03T17:05:58Z) - Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning [64.6334337560557]
Reinforcement learning via supervised learning (RvS) frames offline RL as a sequence modeling task.<n>Decision Transformer (DT) struggles to reliably align the actual achieved returns with specified target returns.<n>We propose Doctor, a novel approach that Double Checks the Transformer with target alignment for Offline RL.
arXiv Detail & Related papers (2025-08-22T14:30:53Z) - EXPO: Stable Reinforcement Learning with Expressive Policies [74.30151915786233]
We propose a sample-efficient online reinforcement learning algorithm to maximize value with two parameterized policies.<n>Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods.
arXiv Detail & Related papers (2025-07-10T17:57:46Z) - Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning [0.0]
We propose a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples.<n>Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability.<n>We extend SSCP to goal-conditioned RL, enabling flat policies to exploit subgoal structures without explicit hierarchical inference.
arXiv Detail & Related papers (2025-06-26T16:09:53Z) - Improving TD3-BC: Relaxed Policy Constraint for Offline Learning and
Stable Online Fine-Tuning [7.462336024223669]
Key challenge is overcoming overestimation bias for actions not present in data.
One simple method to reduce this bias is to introduce a policy constraint via behavioural cloning (BC)
We demonstrate that by continuing to train a policy offline while reducing the influence of the BC component we can produce refined policies.
arXiv Detail & Related papers (2022-11-21T19:10:27Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Supported Policy Optimization for Offline Reinforcement Learning [74.1011309005488]
Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization.
Regularization methods reduce the divergence between the learned policy and the behavior policy.
This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint.
arXiv Detail & Related papers (2022-02-13T07:38:36Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.