Related papers: DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

URL: http://arxiv.org/abs/2601.20218v1
Date: Wed, 28 Jan 2026 03:39:05 GMT
Title: DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment
Authors: Haoyou Deng, Keyu Yan, Chaojie Mao, Xiang Wang, Yu Liu, Changxin Gao, Nong Sang,
Abstract summary: GRPO-based approaches for text-to-image generation suffer from the sparse reward problem.<n>We introduce textbfDenseGRPO, a novel framework that aligns human preference with dense rewards.
Score: 49.45064510462232
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent GRPO-based approaches built on flow matching models have shown remarkable improvements in human preference alignment for text-to-image generation. Nevertheless, they still suffer from the sparse reward problem: the terminal reward of the entire denoising trajectory is applied to all intermediate steps, resulting in a mismatch between the global feedback signals and the exact fine-grained contributions at intermediate denoising steps. To address this issue, we introduce \textbf{DenseGRPO}, a novel framework that aligns human preference with dense rewards, which evaluates the fine-grained contribution of each denoising step. Specifically, our approach includes two key components: (1) we propose to predict the step-wise reward gain as dense reward of each denoising step, which applies a reward model on the intermediate clean images via an ODE-based approach. This manner ensures an alignment between feedback signals and the contributions of individual steps, facilitating effective training; and (2) based on the estimated dense rewards, a mismatch drawback between the uniform exploration setting and the time-varying noise intensity in existing GRPO-based methods is revealed, leading to an inappropriate exploration space. Thus, we propose a reward-aware scheme to calibrate the exploration space by adaptively adjusting a timestep-specific stochasticity injection in the SDE sampler, ensuring a suitable exploration space at all timesteps. Extensive experiments on multiple standard benchmarks demonstrate the effectiveness of the proposed DenseGRPO and highlight the critical role of the valid dense rewards in flow matching model alignment.

Related papers

AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models [54.56296715999545]
Reinforcement learning from human feedback shows promise for aligning diffusion and flow models.<n>Policy optimization methods such as GRPO suffer from inefficient and static sampling strategies.<n>We propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy.
arXiv Detail & Related papers (2026-02-06T16:09:50Z)
Euphonium: Steering Video Flow Matching via Process Reward Gradient Guided Stochastic Dynamics [49.242224984144904]
We propose Euphonium, a novel framework that steers generation via process reward gradient guided dynamics.<n>Our key insight is to formulate the sampling process as a theoretically principled algorithm that explicitly incorporates the gradient of a Process Reward Model.<n>We derive a distillation objective that internalizes the guidance signal into the flow network, eliminating inference-time dependency on the reward model.
arXiv Detail & Related papers (2026-02-04T08:59:57Z)
E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models [30.505448172476402]
We propose E-GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps.<n>Building upon this, we introduce multi-step group normalized advantage, which computes group-relative advantages within samples sharing the same consolidated SDE denoising step.
arXiv Detail & Related papers (2026-01-01T18:27:32Z)
Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning [27.33241821967005]
We propose a novel framework that mitigates Preference Mode Collapse (PMC)<n>D$2$-Align achieves superior alignment with human preference.
arXiv Detail & Related papers (2025-12-30T11:17:52Z)
G$^2$RPO: Granular GRPO for Precise Reward in Flow Models [74.21206048155669]
We propose a novel Granular-GRPO (G$2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions.<n>We introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales.<n>Our G$2$RPO significantly outperforms existing flow-based GRPO baselines.
arXiv Detail & Related papers (2025-10-02T12:57:12Z)
Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z)
Psi-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models [26.211711150915203]
$Psi$-Sampler is an SMC-based framework incorporating pCNL-based initial particle sampling.<n>Inference-time reward alignment with score-based generative models has gained significant traction.
arXiv Detail & Related papers (2025-06-02T05:02:33Z)
Reward-Guided Iterative Refinement in Diffusion Models at Test-Time with Applications to Protein and DNA Design [87.58981407469977]
We propose a novel framework for inference-time reward optimization with diffusion models inspired by evolutionary algorithms.<n>Our approach employs an iterative refinement process consisting of two steps in each iteration: noising and reward-guided denoising.
arXiv Detail & Related papers (2025-02-20T17:48:45Z)
Aligning Few-Step Diffusion Models with Dense Reward Difference Learning [81.85515625591884]
Stepwise Diffusion Policy Optimization (SDPO) is an alignment method tailored for few-step diffusion models. SDPO incorporates dense reward feedback at every intermediate step to ensure consistent alignment across all denoising steps. SDPO consistently outperforms prior methods in reward-based alignment across diverse step configurations.
arXiv Detail & Related papers (2024-11-18T16:57:41Z)
LIRE: listwise reward enhancement for preference alignment [27.50204023448716]
We propose a gradient-based reward optimization approach that incorporates the offline rewards of multiple responses into a streamlined listwise framework. LIRE is straightforward to implement, requiring minimal parameter tuning, and seamlessly aligns with the pairwise paradigm. Our experiments demonstrate that LIRE consistently outperforms existing methods across several benchmarks on dialogue and summarization tasks.
arXiv Detail & Related papers (2024-05-22T10:21:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.