Flow Matching Policy Gradients
- URL: http://arxiv.org/abs/2507.21053v2
- Date: Fri, 01 Aug 2025 13:04:28 GMT
- Title: Flow Matching Policy Gradients
- Authors: David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, Angjoo Kanazawa,
- Abstract summary: Flow Policy Optimization casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss.<n>We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks.<n>We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.
- Score: 33.011978450108636
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the need for exact likelihood computation while preserving the generative capabilities of flow-based models. Unlike prior approaches for diffusion-based reinforcement learning that bind training to a specific sampling method, FPO is agnostic to the choice of diffusion or flow integration at both training and inference time. We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks. We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.
Related papers
- One-Step Flow Policy Mirror Descent [38.39095131927252]
Flow Policy Mirror Descent (FPMD) is an online RL algorithm that enables 1-step sampling during policy inference.<n>Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight flow matching models.
arXiv Detail & Related papers (2025-07-31T15:51:10Z) - Decision Flow Policy Optimization [53.825268058199825]
We show that generative models can effectively model complex multi-modal action distributions and achieve superior robotic control in continuous action spaces.<n>Previous methods usually adopt the generative models as behavior models to fit state-conditioned action distributions from datasets.<n>We propose Decision Flow, a unified framework that integrates multi-modal action distribution modeling and policy optimization.
arXiv Detail & Related papers (2025-05-26T03:42:20Z) - Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization [14.320131946691268]
We propose an easy-to-use and theoretically sound fine-tuning method for flow-based generative models.<n>By introducing an online rewardweighting mechanism, our approach guides the model to prioritize high-reward regions in the data manifold.<n>Our method achieves optimal policy convergence while allowing controllable trade-offs between reward and diversity.
arXiv Detail & Related papers (2025-02-09T22:45:15Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.<n>We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)<n>Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.<n>We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - Thompson sampling for improved exploration in GFlowNets [75.89693358516944]
Generative flow networks (GFlowNets) are amortized variational inference algorithms that treat sampling from a distribution over compositional objects as a sequential decision-making problem with a learnable action policy.
We show in two domains that TS-GFN yields improved exploration and thus faster convergence to the target distribution than the off-policy exploration strategies used in past work.
arXiv Detail & Related papers (2023-06-30T14:19:44Z) - Policy Representation via Diffusion Probability Model for Reinforcement
Learning [67.56363353547775]
We build a theoretical foundation of policy representation via the diffusion probability model.
We present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy.
We propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy.
arXiv Detail & Related papers (2023-05-22T15:23:41Z) - Diverse Policy Optimization for Structured Action Space [59.361076277997704]
We propose Diverse Policy Optimization (DPO) to model the policies in structured action space as the energy-based models (EBM)
A novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler.
Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies.
arXiv Detail & Related papers (2023-02-23T10:48:09Z) - Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset.
We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy.
We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.