Related papers: Flow Matching Policy Gradients

Flow Matching Policy Gradients

URL: http://arxiv.org/abs/2507.21053v2
Date: Fri, 01 Aug 2025 13:04:28 GMT
Title: Flow Matching Policy Gradients
Authors: David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, Angjoo Kanazawa,
Abstract summary: Flow Policy Optimization casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss.<n>We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks.<n>We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.
Score: 33.011978450108636
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the need for exact likelihood computation while preserving the generative capabilities of flow-based models. Unlike prior approaches for diffusion-based reinforcement learning that bind training to a specific sampling method, FPO is agnostic to the choice of diffusion or flow integration at both training and inference time. We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks. We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.

Related papers

PolicyFlow: Policy Optimization with Continuous Normalizing Flow in Reinforcement Learning [6.836651088754774]
PolicyFlow is a novel on-policy CNF-based reinforcement learning algorithm.<n>It integrates expressive CNF policies with PPO-style objectives without requiring likelihood evaluation along the full flow path.<n>PolicyFlow approximates importance ratios using velocity field variations along a simple path, reducing computational overhead without compromising training stability.
arXiv Detail & Related papers (2026-02-01T11:08:09Z)
One-Step Flow Policy Mirror Descent [38.39095131927252]
Flow Policy Mirror Descent (FPMD) is an online RL algorithm that enables 1-step sampling during policy inference.<n>Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight flow matching models.
arXiv Detail & Related papers (2025-07-31T15:51:10Z)
Decision Flow Policy Optimization [53.825268058199825]
We show that generative models can effectively model complex multi-modal action distributions and achieve superior robotic control in continuous action spaces.<n>Previous methods usually adopt the generative models as behavior models to fit state-conditioned action distributions from datasets.<n>We propose Decision Flow, a unified framework that integrates multi-modal action distribution modeling and policy optimization.
arXiv Detail & Related papers (2025-05-26T03:42:20Z)
Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization [14.320131946691268]
We propose an easy-to-use and theoretically sound fine-tuning method for flow-based generative models.<n>By introducing an online rewardweighting mechanism, our approach guides the model to prioritize high-reward regions in the data manifold.<n>Our method achieves optimal policy convergence while allowing controllable trade-offs between reward and diversity.
arXiv Detail & Related papers (2025-02-09T22:45:15Z)
Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.<n>We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)<n>Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.<n>We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z)
Thompson sampling for improved exploration in GFlowNets [75.89693358516944]
Generative flow networks (GFlowNets) are amortized variational inference algorithms that treat sampling from a distribution over compositional objects as a sequential decision-making problem with a learnable action policy. We show in two domains that TS-GFN yields improved exploration and thus faster convergence to the target distribution than the off-policy exploration strategies used in past work.
arXiv Detail & Related papers (2023-06-30T14:19:44Z)
Policy Representation via Diffusion Probability Model for Reinforcement Learning [67.56363353547775]
We build a theoretical foundation of policy representation via the diffusion probability model. We present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy. We propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy.
arXiv Detail & Related papers (2023-05-22T15:23:41Z)
Diverse Policy Optimization for Structured Action Space [59.361076277997704]
We propose Diverse Policy Optimization (DPO) to model the policies in structured action space as the energy-based models (EBM) A novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler. Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies.
arXiv Detail & Related papers (2023-02-23T10:48:09Z)
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.