Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models
- URL: http://arxiv.org/abs/2509.25050v1
- Date: Mon, 29 Sep 2025 17:02:20 GMT
- Title: Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models
- Authors: Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, Zhi-Ming Ma,
- Abstract summary: We introduce textbfAdvantage Weighted Matching (AWM), a policy-gradient method for diffusion.<n>AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining.<n>This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence.
- Score: 35.36024202299119
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce \textbf{Advantage Weighted Matching (AWM)}, a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a $24\times$ speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at https://github.com/scxue/advantage_weighted_matching.
Related papers
- Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies [4.249024052507976]
We propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples.<n>By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample.<n>We show that existing noise-expectation and gradient-expectation methods are two specific instances within this broader class.
arXiv Detail & Related papers (2026-01-13T01:58:24Z) - Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization [44.14678335188207]
Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs)<n>Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning.<n>This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method.
arXiv Detail & Related papers (2025-10-09T13:59:50Z) - DiFFPO: Training Diffusion LLMs to Reason Fast and Furious via Reinforcement Learning [37.20873499361773]
We propose a unified framework for training masked diffusion large language models (dLLMs) to reason better (furious)<n>We first unify the existing baseline approach by proposing to train surrogate policies via off-policy RL, whose likelihood is much more tractable as an approximation to the true dLLM policy.<n>We also propose a new direction of joint training efficient samplers/controllers of dLLMs policy. Via RL, we incentivize dLLMs' natural multi-token prediction capabilities by letting the model learn to adaptively allocate an inference threshold for each prompt.
arXiv Detail & Related papers (2025-10-02T16:57:24Z) - Learning to Reason as Action Abstractions with Scalable Mid-Training RL [55.24192942739207]
An effective mid-training phase should identify a compact set of useful actions and enable fast selection.<n>We propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm.
arXiv Detail & Related papers (2025-09-30T05:34:20Z) - DiffusionNFT: Online Diffusion Reinforcement with Forward Process [99.94852379720153]
Diffusion Negative-aware FineTuning (DiffusionNFT) is a new online RL paradigm that optimize diffusion models directly on the forward process via flow matching.<n>DiffusionNFT is up to $25times$ more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free.
arXiv Detail & Related papers (2025-09-19T16:09:33Z) - FlowRL: Matching Reward Distributions for LLM Reasoning [69.88820066093798]
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL)<n>We transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution.
arXiv Detail & Related papers (2025-09-18T17:56:36Z) - SPIRE: Conditional Personalization for Federated Diffusion Generative Models [7.8583640700306585]
Shared Backbone Personal Identity Representation Embeddings (SPIRE) is a framework that casts per client diffusion based generation as conditional generation in FL.<n>SPIRE factorizes the network into (i) a high capacity global backbone that learns a population level score function and (ii) lightweight, learnable client embeddings that encode local data statistics.<n>Our analysis also hints at how client embeddings act as biases that steer a shared score network toward personalized distributions.
arXiv Detail & Related papers (2025-06-14T01:40:31Z) - Flow-GRPO: Training Flow Matching Models via Online RL [75.70017261794422]
We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models.<n>Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Equation (ODE) into an equivalent Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number.
arXiv Detail & Related papers (2025-05-08T17:58:45Z) - Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance [52.65461207786633]
Policy-based Reinforcement Learning from Human Feedback is essential for aligning large language models with human preferences.<n>It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance.<n>We propose textbfDecoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained emphglobal value model (GVM)
arXiv Detail & Related papers (2025-02-24T08:11:33Z) - Understanding Reinforcement Learning-Based Fine-Tuning of Diffusion Models: A Tutorial and Review [63.31328039424469]
This tutorial provides a comprehensive survey of methods for fine-tuning diffusion models to optimize downstream reward functions.
We explain the application of various RL algorithms, including PPO, differentiable optimization, reward-weighted MLE, value-weighted sampling, and path consistency learning.
arXiv Detail & Related papers (2024-07-18T17:35:32Z) - Generative Modeling with Flow-Guided Density Ratio Learning [12.192867460641835]
Flow-Guided Density Ratio Learning (FDRL) is a simple and scalable approach to generative modeling.
We show that FDRL can generate images of dimensions as high as $128times128$, as well as outperform existing gradient flow baselines on quantitative benchmarks.
arXiv Detail & Related papers (2023-03-07T07:55:52Z) - Stable Target Field for Reduced Variance Score Estimation in Diffusion
Models [5.9115407007859755]
Diffusion models generate samples by reversing a fixed forward diffusion process.
We argue that the source of such variance lies in the handling of intermediate noise-variance scales.
We propose to remedy the problem by incorporating a reference batch which we use to calculate weighted conditional scores as more stable training targets.
arXiv Detail & Related papers (2023-02-01T18:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.