Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies
- URL: http://arxiv.org/abs/2601.08136v1
- Date: Tue, 13 Jan 2026 01:58:24 GMT
- Title: Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies
- Authors: Zeyang Li, Sunbochen Tang, Navid Azizan,
- Abstract summary: We propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples.<n>By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample.<n>We show that existing noise-expectation and gradient-expectation methods are two specific instances within this broader class.
- Score: 4.249024052507976
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion and flow policies are gaining prominence in online reinforcement learning (RL) due to their expressive power, yet training them efficiently remains a critical challenge. A fundamental difficulty in online RL is the lack of direct samples from the target distribution; instead, the target is an unnormalized Boltzmann distribution defined by the Q-function. To address this, two seemingly distinct families of methods have been proposed for diffusion policies: a noise-expectation family, which utilizes a weighted average of noise as the training target, and a gradient-expectation family, which employs a weighted average of Q-function gradients. Yet, it remains unclear how these objectives relate formally or if they can be synthesized into a more general formulation. In this paper, we propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples. By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample. Crucially, we introduce Langevin Stein operators to construct zero-mean control variates, deriving a general class of estimators that effectively reduce importance sampling variance. We show that existing noise-expectation and gradient-expectation methods are two specific instances within this broader class. This unified view yields two key advancements: it extends the capability of targeting Boltzmann distributions from diffusion to flow policies, and enables the principled combination of Q-value and Q-gradient information to derive an optimal, minimum-variance estimator, thereby improving training efficiency and stability. We instantiate RFM to train a flow policy in online RL, and demonstrate improved performance on continuous-control benchmarks compared to diffusion policy baselines.
Related papers
- Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching [8.665369041430969]
Flow Matching (FM) enables one-step generation, but integrating it into Entropy Reinforcement Learning (MaxEnt RL) is challenging.<n>We propose textbfFlow-based textbfLog-likelihood-textbfAware textbfMaximum textbfEntropy RL (textbfFLAME), a principled framework that addresses these challenges.
arXiv Detail & Related papers (2026-02-02T03:54:11Z) - Q-learning with Adjoint Matching [58.78551025170267]
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm.<n>QAM sidesteps two challenges by leveraging adjoint matching, a recently proposed technique in generative modeling.<n>It consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
arXiv Detail & Related papers (2026-01-20T18:45:34Z) - Data-regularized Reinforcement Learning for Diffusion Models at Scale [99.01056178660538]
We introduce Data-regularized Diffusion Reinforcement Learning ( DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution.<n>With over a million GPU hours of experiments and ten thousand double-blind evaluations, we demonstrate that DDRL significantly improves rewards while alleviating the reward hacking seen in RLs.
arXiv Detail & Related papers (2025-12-03T23:45:07Z) - DiFFPO: Training Diffusion LLMs to Reason Fast and Furious via Reinforcement Learning [37.20873499361773]
We propose a unified framework for training masked diffusion large language models (dLLMs) to reason better (furious)<n>We first unify the existing baseline approach by proposing to train surrogate policies via off-policy RL, whose likelihood is much more tractable as an approximation to the true dLLM policy.<n>We also propose a new direction of joint training efficient samplers/controllers of dLLMs policy. Via RL, we incentivize dLLMs' natural multi-token prediction capabilities by letting the model learn to adaptively allocate an inference threshold for each prompt.
arXiv Detail & Related papers (2025-10-02T16:57:24Z) - Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models [35.36024202299119]
We introduce textbfAdvantage Weighted Matching (AWM), a policy-gradient method for diffusion.<n>AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining.<n>This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence.
arXiv Detail & Related papers (2025-09-29T17:02:20Z) - DiffusionNFT: Online Diffusion Reinforcement with Forward Process [99.94852379720153]
Diffusion Negative-aware FineTuning (DiffusionNFT) is a new online RL paradigm that optimize diffusion models directly on the forward process via flow matching.<n>DiffusionNFT is up to $25times$ more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free.
arXiv Detail & Related papers (2025-09-19T16:09:33Z) - One-Step Flow Policy Mirror Descent [52.31612487608593]
Flow Policy Mirror Descent (FPMD) is an online RL algorithm that enables 1-step sampling during flow policy inference.<n>Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight flow matching models.
arXiv Detail & Related papers (2025-07-31T15:51:10Z) - Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning [13.163511229897667]
In offline reinforcement learning, it is necessary to manage out-of-distribution actions to prevent overestimation of value functions.<n>We propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem.<n>Our approach is evaluated on D4RL benchmarks and outperforms the state-of-the-art in nearly all environments.
arXiv Detail & Related papers (2024-05-31T00:41:04Z) - Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset.
We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy.
We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z) - Bootstrap Your Flow [4.374837991804085]
We develop a new flow-based training procedure, FAB (Flow AIS Bootstrap), to produce accurate approximations to complex target distributions.
We demonstrate that FAB can be used to produce accurate approximations to complex target distributions, including Boltzmann distributions, in problems where previous flow-based methods fail.
arXiv Detail & Related papers (2021-11-22T20:11:47Z) - KL Guided Domain Adaptation [88.19298405363452]
Domain adaptation is an important problem and often needed for real-world applications.
A common approach in the domain adaptation literature is to learn a representation of the input that has the same distributions over the source and the target domain.
We show that with a probabilistic representation network, the KL term can be estimated efficiently via minibatch samples.
arXiv Detail & Related papers (2021-06-14T22:24:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.