Related papers: E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models

E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models

URL: http://arxiv.org/abs/2601.00423v1
Date: Thu, 01 Jan 2026 18:27:32 GMT
Title: E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models
Authors: Shengjun Zhang, Zhang Zhang, Chensheng Dai, Yueqi Duan,
Abstract summary: We propose E-GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps.<n>Building upon this, we introduce multi-step group normalized advantage, which computes group-relative advantages within samples sharing the same consolidated SDE denoising step.
Score: 30.505448172476402
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent reinforcement learning has enhanced the flow matching models on human preference alignment. While stochastic sampling enables the exploration of denoising directions, existing methods which optimize over multiple denoising steps suffer from sparse and ambiguous reward signals. We observe that the high entropy steps enable more efficient and effective exploration while the low entropy steps result in undistinguished roll-outs. To this end, we propose E-GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps. Since the integration of stochastic differential equations suffer from ambiguous reward signals due to stochasticity from multiple steps, we specifically merge consecutive low entropy steps to formulate one high entropy step for SDE sampling, while applying ODE sampling on other steps. Building upon this, we introduce multi-step group normalized advantage, which computes group-relative advantages within samples sharing the same consolidated SDE denoising step. Experimental results on different reward settings have demonstrated the effectiveness of our methods.

Related papers

Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages [6.470160796651034]
We propose a novel framework for training flow matching text to image models into efficient few step generators well aligned with human preferences.<n>We show that TAFS GRPO achieves strong performance in few step text to image generation and significantly improves the alignment of generated images with human preferences.
arXiv Detail & Related papers (2026-02-02T03:32:00Z)
DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment [49.45064510462232]
GRPO-based approaches for text-to-image generation suffer from the sparse reward problem.<n>We introduce textbfDenseGRPO, a novel framework that aligns human preference with dense rewards.
arXiv Detail & Related papers (2026-01-28T03:39:05Z)
Parallel Diffusion Solver via Residual Dirichlet Policy Optimization [88.7827307535107]
Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature.<n>Existing solver-based acceleration methods often face significant image quality degradation under a low-dimensional budget.<n>We propose the Ensemble Parallel Direction solver (dubbed as EPD-EPr), a novel ODE solver that mitigates these errors by incorporating multiple gradient parallel evaluations in each step.
arXiv Detail & Related papers (2025-12-28T05:48:55Z)
G$^2$RPO: Granular GRPO for Precise Reward in Flow Models [74.21206048155669]
We propose a novel Granular-GRPO (G$2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions.<n>We introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales.<n>Our G$2$RPO significantly outperforms existing flow-based GRPO baselines.
arXiv Detail & Related papers (2025-10-02T12:57:12Z)
Aligning Few-Step Diffusion Models with Dense Reward Difference Learning [81.85515625591884]
Stepwise Diffusion Policy Optimization (SDPO) is an alignment method tailored for few-step diffusion models. SDPO incorporates dense reward feedback at every intermediate step to ensure consistent alignment across all denoising steps. SDPO consistently outperforms prior methods in reward-based alignment across diverse step configurations.
arXiv Detail & Related papers (2024-11-18T16:57:41Z)
Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control [26.195547996552406]
We cast reward fine-tuning as optimal control (SOC) for dynamical generative models that produce samples through an iterative process.<n>We find that our approach significantly improves over existing methods for reward fine-tuning, achieving better consistency, realism, and generalization to unseen human preference reward models.
arXiv Detail & Related papers (2024-09-13T14:22:14Z)
Score-based Generative Models with Adaptive Momentum [40.84399531998246]
We propose an adaptive momentum sampling method to accelerate the transforming process. We show that our method can produce more faithful images/graphs in small sampling steps with 2 to 5 times speed up.
arXiv Detail & Related papers (2024-05-22T15:20:27Z)
SA-Solver: Stochastic Adams Solver for Fast Sampling of Diffusion Models [63.49229402384349]
Diffusion Probabilistic Models (DPMs) have achieved considerable success in generation tasks.<n>As sampling from DPMs is equivalent to solving diffusion SDE or ODE which is time-consuming, numerous fast sampling methods built upon improved differential equation solvers are proposed.<n>We propose textitSA-r, which is an improved efficient method for solving SDE to generate data with high quality.
arXiv Detail & Related papers (2023-09-10T12:44:54Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.