Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching
- URL: http://arxiv.org/abs/2602.01606v1
- Date: Mon, 02 Feb 2026 03:54:11 GMT
- Title: Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching
- Authors: Zeqiao Li, Yijing Wang, Haoyu Wang, Zheng Li, Zhiqiang Zuo,
- Abstract summary: Flow Matching (FM) enables one-step generation, but integrating it into Entropy Reinforcement Learning (MaxEnt RL) is challenging.<n>We propose textbfFlow-based textbfLog-likelihood-textbfAware textbfMaximum textbfEntropy RL (textbfFLAME), a principled framework that addresses these challenges.
- Score: 8.665369041430969
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion policies are expressive yet incur high inference latency. Flow Matching (FM) enables one-step generation, but integrating it into Maximum Entropy Reinforcement Learning (MaxEnt RL) is challenging: the optimal policy is an intractable energy-based distribution, and the efficient log-likelihood estimation required to balance exploration and exploitation suffers from severe discretization bias. We propose \textbf{F}low-based \textbf{L}og-likelihood-\textbf{A}ware \textbf{M}aximum \textbf{E}ntropy RL (\textbf{FLAME}), a principled framework that addresses these challenges. First, we derive a Q-Reweighted FM objective that bypasses partition function estimation via importance reweighting. Second, we design a decoupled entropy estimator that rigorously corrects bias, which enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Third, we integrate the MeanFlow formulation to achieve expressive and efficient one-step control. Empirical results on MuJoCo show that FLAME outperforms Gaussian baselines and matches multi-step diffusion policies with significantly lower inference cost. Code is available at https://github.com/lzqw/FLAME.
Related papers
- FLAC: Maximum Entropy RL via Kinetic Energy Regularized Bridge Matching [28.98935867615678]
We propose a framework that regulates policyity by penalizing the kinetic energy of the velocity field.<n>We derive an energy-regularized policy scheme and a practical off-policy algorithm that automatically tunes the kinetic energy.
arXiv Detail & Related papers (2026-02-13T11:32:10Z) - Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z) - Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies [4.249024052507976]
We propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples.<n>By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample.<n>We show that existing noise-expectation and gradient-expectation methods are two specific instances within this broader class.
arXiv Detail & Related papers (2026-01-13T01:58:24Z) - A Diffusion Model Framework for Maximum Entropy Reinforcement Learning [32.26181994745642]
We present a modified surrogate objective for MaxEntRL that incorporates diffusion dynamics in a principled way.<n>We find that DiffSAC, DiffPPO and DiffWPO achieve better returns and higher sample efficiency than SAC and PPO.
arXiv Detail & Related papers (2025-12-01T18:59:58Z) - Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models [53.339700196282905]
A key challenge in applying reinforcement learning to large language models (dLLMs) is the intractability of their likelihood functions.<n>We propose a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective.<n> Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks.
arXiv Detail & Related papers (2025-10-13T17:47:50Z) - FlowRL: Matching Reward Distributions for LLM Reasoning [69.88820066093798]
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL)<n>We transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution.
arXiv Detail & Related papers (2025-09-18T17:56:36Z) - One-Step Flow Policy Mirror Descent [52.31612487608593]
Flow Policy Mirror Descent (FPMD) is an online RL algorithm that enables 1-step sampling during flow policy inference.<n>Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight flow matching models.
arXiv Detail & Related papers (2025-07-31T15:51:10Z) - Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits [58.63897489864948]
Reinforcement learning with outcome-based feedback faces a fundamental challenge.<n>How do we assign credit to the right actions?<n>This paper provides the first comprehensive analysis of this problem in online RL with general function approximation.
arXiv Detail & Related papers (2025-05-26T17:44:08Z) - DIME:Diffusion-Based Maximum Entropy Reinforcement Learning [38.17326719163195]
Diffusion-Based Maximum Entropy RL (DIME)<n>emphDIME leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective.<n>Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of MaxEnt-RL.
arXiv Detail & Related papers (2025-02-04T13:37:14Z) - Sampling from Energy-based Policies using Diffusion [18.135501150108894]
Energy-based policies offer a flexible framework for modeling complex, multimodal behaviors in reinforcement learning.<n>Existing methods typically use simpler parametric distributions, like Gaussians, for policy representation.<n>We introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function.
arXiv Detail & Related papers (2024-10-02T08:09:33Z) - Discrete Probabilistic Inference as Control in Multi-path Environments [84.67055173040107]
We consider the problem of sampling from a discrete and structured distribution as a sequential decision problem.
We show that GFlowNets learn a policy that samples objects proportionally to their reward by enforcing a conservation of flows.
We also prove that some flow-matching objectives found in the GFlowNet literature are in fact equivalent to well-established MaxEnt RL algorithms with a corrected reward.
arXiv Detail & Related papers (2024-02-15T20:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.