2026-04-25 Daily Report: V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

Authors Bingda Tang, Yuhui Zhang, Xiaohan Wang, Jiayuan Mao, Ludwig Schmidt, Serena Yeung-Levy

Affiliations Stanford University / Tsinghua University / Amazon / University of Pennsylvania

Categories Method / Policy Optimization / Group relative policy optimization, Application / Image Synthesis / Text-to-image synthesis performance, Evaluation / Model Efficiency / Speedup comparison with baseline methods

License CC BY 4.0

Abstract Overview

This paper proposes Variational GRPO (V-GRPO), a method that integrates ELBO-based likelihood surrogates into the Group Relative Policy Optimization (GRPO) algorithm for online reinforcement learning of denoising generative models. The authors address the known instability of ELBO-based surrogates in visual generation by introducing variance-reduction techniques (group-shared timestep-noise pairs, stratified timestep sampling, adaptive loss weighting) and gradient-control strategies (importance-ratio clipping, KL penalty, advantage soft-clipping). The method treats multi-step generation as an atomic action, avoiding the MDP formulation over sampling trajectories used by prior approaches. Experiments on FLUX.1-dev and SD 3.5 M demonstrate that V-GRPO achieves state-of-the-art or comparable text-to-image alignment while delivering significant training speedups over MDP-based baselines.

Novelty

The main novelty is demonstrating that ELBO-based surrogates, previously reported to underperform in visual generation, can match or outperform MDP-based online RL approaches when stabilized with a specific set of variance-reduction and gradient-control techniques. The work is also distinctive in treating generation as an atomic action within GRPO rather than decomposing it into sequential MDP transitions, which decouples optimization from the sampling process and permits higher-order ODE solvers during rollout.

Results

On FLUX.1-dev, V-GRPO outperforms all compared baselines (including MixGRPO and BranchGRPO) across HPSv2.1, PickScore, ImageReward, and UnifiedReward at 300 iterations, while converging to comparable reward at half the iterations (150 vs. 300 for MixGRPO). On SD 3.5 M, V-GRPO matches DiffusionNFT on GenEval and OCR while improving or matching model-based metrics (CLIPScore, HPSv2.1, Aesthetics), using roughly 3× fewer gradient steps (580 vs. 1.7K) and substantially lower function-evaluation cost. Ablations confirm that the proposed variance-reduction techniques are collectively essential for stability and that different gradient-control strategies are suited to different training regimes.

Key Points

V-GRPO replaces the MDP-based trajectory optimization used in prior work with an ELBO-based surrogate for model log-likelihood within GRPO, treating multi-step generation as an atomic action and decoupling optimization from the sampling process.
The method's practical contribution is a stabilization recipe comprising group-shared timestep-noise pairs, stratified timestep sampling, and adaptive loss weighting for variance reduction, combined with selective use of importance-ratio clipping, KL penalty, or advantage soft-clipping depending on the training regime.
Empirically, V-GRPO achieves the best reported multi-reward results on FLUX.1-dev among compared methods and matches DiffusionNFT on SD 3.5 M, while delivering a 2× speedup over MixGRPO and a 3× speedup over DiffusionNFT in terms of training steps.

References

arXiv: https://arxiv.org/abs/2604.23380v1
Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.23380v1
GitHub: https://github.com/tang-bd/v-grpo

GitHub