Related papers: G$^2$RPO: Granular GRPO for Precise Reward in Flow Models

G$^2$RPO: Granular GRPO for Precise Reward in Flow Models

URL: http://arxiv.org/abs/2510.01982v2
Date: Fri, 10 Oct 2025 08:40:51 GMT
Title: G$^2$RPO: Granular GRPO for Precise Reward in Flow Models
Authors: Yujie Zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang, Jiaqi Wang, Li Niu, Guangtao Zhai,
Abstract summary: We propose a novel Granular-GRPO (G$2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions.<n>We introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales.<n>Our G$2$RPO significantly outperforms existing flow-based GRPO baselines.
Score: 74.21206048155669
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO (G$^2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our G$^2$RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.

Related papers

SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning [50.93295951454092]
We introduce a set level diversity objective defined over sampled trajectories using kernelized similarity.<n>Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization.<n>Experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.
arXiv Detail & Related papers (2026-02-01T07:13:20Z)
DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment [49.45064510462232]
GRPO-based approaches for text-to-image generation suffer from the sparse reward problem.<n>We introduce textbfDenseGRPO, a novel framework that aligns human preference with dense rewards.
arXiv Detail & Related papers (2026-01-28T03:39:05Z)
Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation [6.597818816347323]
Direct Preference Optimization aims to align generative outputs with human preferences by distinguishing between chosen and rejected samples.<n>A critical limitation of DPO is likelihood displacement, where the probabilities of chosen samples paradoxically decrease during training.<n>We introduce a novel solution named Policy-Guided DPO, combining Adaptive Rejection Scaling (ARS) and Implicit Preference Regularization (IPR)<n>Experiments show that PG-DPO outperforms existing methods in both quantitative metrics and qualitative evaluations.
arXiv Detail & Related papers (2025-11-24T12:37:49Z)
Understanding Sampler Stochasticity in Training Diffusion Models for RLHF [11.537564997052606]
This paper theoretically characterizes the reward gap and provides non-vacuous bounds for general diffusion models.<n> Empirically, our findings through large-scale experiments on text-to-image models validate that reward gaps consistently narrow over training.
arXiv Detail & Related papers (2025-10-12T19:08:38Z)
Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching [6.238027696245818]
Reinforcement Learning (RL) has emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models.<n>Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images.<n>Our proposed method, Coefficients-Preserving Sampling (CPS) eliminates these noise artifacts.
arXiv Detail & Related papers (2025-09-07T07:25:00Z)
Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z)
Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering [18.543769006014383]
Diffusion models often exhibit inconsistent sample quality due to variations inherent in their sampling trajectories.<n>We introduce CFG-Rejection, an efficient, plug-and-play strategy that filters low-quality samples at an early stage of the denoising process.<n>We validate the effectiveness of CFG-Rejection in image generation through extensive experiments.
arXiv Detail & Related papers (2025-05-29T11:08:24Z)
Prior-Guided Diffusion Planning for Offline Reinforcement Learning [4.760537994346813]
Prior Guidance (PG) is a novel guided sampling framework that replaces the standard Gaussian prior-of-cloned diffusion model.<n>PG directly generates high-value trajectories without costly reward optimization of the diffusion model itself.<n>We present an efficient training strategy that applies behavior regularization in latent space, and empirically demonstrate that PG outperforms state-the-art diffusion policies and planners across diverse long-horizon offline RL benchmarks.
arXiv Detail & Related papers (2025-05-16T05:39:02Z)
Inference-Time Alignment of Diffusion Models with Direct Noise Optimization [45.77751895345154]
We propose a novel alignment approach, named Direct Noise Optimization (DNO), that optimize the injected noise during the sampling process of diffusion models. By design, DNO operates at inference-time, and thus is tuning-free and prompt-agnostic, with the alignment occurring in an online fashion during generation. We conduct extensive experiments on several important reward functions and demonstrate that the proposed DNO approach can achieve state-of-the-art reward scores within a reasonable time budget for generation.
arXiv Detail & Related papers (2024-05-29T08:39:39Z)
ROPO: Robust Preference Optimization for Large Language Models [59.10763211091664]
We propose an iterative alignment approach that integrates noise-tolerance and filtering of noisy samples without the aid of external models. Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B demonstrate that ROPO significantly outperforms existing preference alignment methods.
arXiv Detail & Related papers (2024-04-05T13:58:51Z)
Improved off-policy training of diffusion samplers [93.66433483772055]
We study the problem of training diffusion models to sample from a distribution with an unnormalized density or energy function.<n>We benchmark several diffusion-structured inference methods, including simulation-based variational approaches and off-policy methods.<n>Our results shed light on the relative advantages of existing algorithms while bringing into question some claims from past work.
arXiv Detail & Related papers (2024-02-07T18:51:49Z)
Ensemble Kalman Filtering Meets Gaussian Process SSM for Non-Mean-Field and Online Inference [47.460898983429374]
We introduce an ensemble Kalman filter (EnKF) into the non-mean-field (NMF) variational inference framework to approximate the posterior distribution of the latent states. This novel marriage between EnKF and GPSSM not only eliminates the need for extensive parameterization in learning variational distributions, but also enables an interpretable, closed-form approximation of the evidence lower bound (ELBO) We demonstrate that the resulting EnKF-aided online algorithm embodies a principled objective function by ensuring data-fitting accuracy while incorporating model regularizations to mitigate overfitting.
arXiv Detail & Related papers (2023-12-10T15:22:30Z)
Optimal Budgeted Rejection Sampling for Generative Models [54.050498411883495]
Rejection sampling methods have been proposed to improve the performance of discriminator-based generative models. We first propose an Optimal Budgeted Rejection Sampling scheme that is provably optimal. Second, we propose an end-to-end method that incorporates the sampling scheme into the training procedure to further enhance the model's overall performance.
arXiv Detail & Related papers (2023-11-01T11:52:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.