Related papers: Flow-GRPO: Training Flow Matching Models via Online RL

Flow-GRPO: Training Flow Matching Models via Online RL

URL: http://arxiv.org/abs/2505.05470v3
Date: Wed, 04 Jun 2025 13:31:15 GMT
Title: Flow-GRPO: Training Flow Matching Models via Online RL
Authors: Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang,
Abstract summary: We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models.<n>Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Equation (ODE) into an equivalent Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number.
Score: 75.70017261794422
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from 63% to 95%. In visual text rendering, its accuracy improves from 59% to 92%, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, very little reward hacking occurred, meaning rewards did not increase at the cost of appreciable image quality or diversity degradation.

Related papers

Improving Progressive Generation with Decomposable Flow Matching [50.63174319509629]
Decomposable Flow Matching (DFM) is a simple and effective framework for the progressive generation of visual media.<n>On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline.
arXiv Detail & Related papers (2025-06-24T17:58:02Z)
Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z)
Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing [10.542645300983878]
We propose an inference-time scaling approach for pretrained flow models.<n>We show that SDE-based generation, particularly variance-preserving (VP) interpolant-based generation, improves particle sampling methods for inference-time scaling in flow models.
arXiv Detail & Related papers (2025-03-25T06:30:45Z)
ROCM: RLHF on consistency models [8.905375742101707]
We propose a reward optimization framework for applying RLHF to consistency models.<n>We investigate various $f$-divergences as regularization strategies, striking a balance between reward and model consistency.
arXiv Detail & Related papers (2025-03-08T11:19:48Z)
Distributionally Robust Reinforcement Learning with Human Feedback [13.509499718691016]
We introduce a distributionally robust RLHF for fine-tuning large language models.<n>Our goal is to ensure that a fine-tuned model retains its performance even when the distribution of prompts significantly differs.<n>We show that our robust training improves the accuracy of the learned reward models on average, and markedly on some tasks, such as reasoning.
arXiv Detail & Related papers (2025-03-01T15:43:39Z)
Provably Efficient Online RLHF with One-Pass Reward Modeling [59.30310692855397]
We propose a one-pass reward modeling method that does not require storing the historical data and can be computed in constant time.<n>We provide theoretical guarantees showing that our method improves both statistical and computational efficiency.<n>We conduct experiments using Llama-3-8B-Instruct and Qwen2.5-7B-Instruct models on the Ultrafeedback-binarized and Mixture2 datasets.
arXiv Detail & Related papers (2025-02-11T02:36:01Z)
Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization [14.320131946691268]
We propose an easy-to-use and theoretically sound fine-tuning method for flow-based generative models.<n>By introducing an online rewardweighting mechanism, our approach guides the model to prioritize high-reward regions in the data manifold.<n>Our method achieves optimal policy convergence while allowing controllable trade-offs between reward and diversity.
arXiv Detail & Related papers (2025-02-09T22:45:15Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
We propose a unified training framework for deep neural networks.<n>We introduce three instances of MARS that leverage preconditioned gradient optimization.<n>Results indicate that the implementation of MARS consistently outperforms Adam.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
FlowTS: Time Series Generation via Rectified Flow [67.41208519939626]
FlowTS is an ODE-based model that leverages rectified flow with straight-line transport in probability space.<n>For unconditional setting, FlowTS achieves state-of-the-art performance, with context FID scores of 0.019 and 0.011 on Stock and ETTh datasets.<n>For conditional setting, we have achieved superior performance in solar forecasting.
arXiv Detail & Related papers (2024-11-12T03:03:23Z)
UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation [93.38604803625294]
We present UncertaintyRAG, a novel approach for long-context Retrieval-Augmented Generation (RAG) We use Signal-to-Noise Ratio (SNR)-based span uncertainty to estimate similarity between text chunks. UncertaintyRAG outperforms baselines by 2.03% on LLaMA-2-7B, achieving state-of-the-art results.
arXiv Detail & Related papers (2024-10-03T17:39:38Z)
A Distribution-Aware Flow-Matching for Generating Unstructured Data for Few-Shot Reinforcement Learning [1.0709300917082865]
We introduce a distribution-aware flow matching approach to generate synthetic unstructured data for few-shot reinforcement learning.<n>Our approach addresses key challenges in traditional model-based RL, such as overfitting and data correlation.<n>Results demonstrate that our method achieves stable convergence in terms of maximum Q-value while enhancing frame rates by 30% in the initial timestamps.
arXiv Detail & Related papers (2024-09-21T15:50:59Z)
Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z)
Faster Adaptive Federated Learning [84.38913517122619]
Federated learning has attracted increasing attention with the emergence of distributed data. In this paper, we propose an efficient adaptive algorithm (i.e., FAFED) based on momentum-based variance reduced technique in cross-silo FL.
arXiv Detail & Related papers (2022-12-02T05:07:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.