V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
Abstract Overview
This paper proposes Variational GRPO (V-GRPO), a method that integrates ELBO-based likelihood surrogates into the Group Relative Policy Optimization (GRPO) algorithm for online reinforcement learning of denoising generative models. The authors address the known instability of ELBO-based surrogates in visual generation by introducing variance-reduction techniques (group-shared timestep-noise pairs, stratified timestep sampling, adaptive loss weighting) and gradient-control strategies (importance-ratio clipping, KL penalty, advantage soft-clipping). The method treats multi-step generation as an atomic action, avoiding the MDP formulation over sampling trajectories used by prior approaches. Experiments on FLUX.1-dev and SD 3.5 M demonstrate that V-GRPO achieves state-of-the-art or comparable text-to-image alignment while delivering significant training speedups over MDP-based baselines.
Novelty
The main novelty is demonstrating that ELBO-based surrogates, previously reported to underperform in visual generation, can match or outperform MDP-based online RL approaches when stabilized with a specific set of variance-reduction and gradient-control techniques. The work is also distinctive in treating generation as an atomic action within GRPO rather than decomposing it into sequential MDP transitions, which decouples optimization from the sampling process and permits higher-order ODE solvers during rollout.
Results
On FLUX.1-dev, V-GRPO outperforms all compared baselines (including MixGRPO and BranchGRPO) across HPSv2.1, PickScore, ImageReward, and UnifiedReward at 300 iterations, while converging to comparable reward at half the iterations (150 vs. 300 for MixGRPO). On SD 3.5 M, V-GRPO matches DiffusionNFT on GenEval and OCR while improving or matching model-based metrics (CLIPScore, HPSv2.1, Aesthetics), using roughly 3× fewer gradient steps (580 vs. 1.7K) and substantially lower function-evaluation cost. Ablations confirm that the proposed variance-reduction techniques are collectively essential for stability and that different gradient-control strategies are suited to different training regimes.
Key Points
- V-GRPO replaces the MDP-based trajectory optimization used in prior work with an ELBO-based surrogate for model log-likelihood within GRPO, treating multi-step generation as an atomic action and decoupling optimization from the sampling process.
- The method's practical contribution is a stabilization recipe comprising group-shared timestep-noise pairs, stratified timestep sampling, and adaptive loss weighting for variance reduction, combined with selective use of importance-ratio clipping, KL penalty, or advantage soft-clipping depending on the training regime.
- Empirically, V-GRPO achieves the best reported multi-reward results on FLUX.1-dev among compared methods and matches DiffusionNFT on SD 3.5 M, while delivering a 2× speedup over MixGRPO and a 3× speedup over DiffusionNFT in terms of training steps.