Related papers: DanceGRPO: Unleashing GRPO on Visual Generation

DanceGRPO: Unleashing GRPO on Visual Generation

URL: http://arxiv.org/abs/2505.07818v1
Date: Mon, 12 May 2025 17:59:34 GMT
Title: DanceGRPO: Unleashing GRPO on Visual Generation
Authors: Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, Ping Luo,
Abstract summary: This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization to visual generation paradigms.<n>We show consistent and substantial improvements, which outperform baselines by up to 181% on benchmarks such as HPS-v2.1, CLIP Score, VideoAlign, and GenEval.<n>Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis.
Score: 36.36813831536346
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent breakthroughs in generative models-particularly diffusion models and rectified flows-have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. Existing reinforcement learning (RL)-based methods for visual generation face critical limitations: incompatibility with modern Ordinary Differential Equations (ODEs)-based sampling paradigms, instability in large-scale training, and lack of validation for video generation. This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReel-I2V), and five reward models (image/video aesthetics, text-image alignment, video motion quality, and binary reward). To our knowledge, DanceGRPO is the first RL-based unified framework capable of seamless adaptation across diverse generative paradigms, tasks, foundational models, and reward models. DanceGRPO demonstrates consistent and substantial improvements, which outperform baselines by up to 181% on benchmarks such as HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Notably, DanceGRPO not only can stabilize policy optimization for complex video generation, but also enables generative policy to better capture denoising trajectories for Best-of-N inference scaling and learn from sparse binary feedback. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis. The code will be released.

Related papers

Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling [80.30976039119236]
Lumina-mGPT 2.0 is a stand-alone, decoder-only autoregressive model.<n>It is trained entirely from scratch, enabling unrestricted architectural design and licensing freedom.<n>It achieves generation quality on par with state-of-the-art diffusion models.
arXiv Detail & Related papers (2025-07-23T17:42:13Z)
SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution [55.14432034345353]
We study key design principles for latter cascaded video super-resolution models, which are underexplored currently.<n>First, we propose two strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator.<n>Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs.
arXiv Detail & Related papers (2025-06-24T17:57:26Z)
Seedance 1.0: Exploring the Boundaries of Video Generation Models [71.26796999246068]
Seedance 1.0 is a high-performance and inference-efficient video foundation generation model.<n>It integrates multi-source curation data augmented with precision and meaningful video captioning.<n>Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds ( NVIDIA-L20)
arXiv Detail & Related papers (2025-06-10T17:56:11Z)
ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL [54.100889131719626]
Chain-of-thought reasoning and reinforcement learning have driven breakthroughs in NLP.<n>We introduce ReasonGen-R1, a framework that imbues an autoregressive image generator with explicit text-based "thinking" skills.<n>We show that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models.
arXiv Detail & Related papers (2025-05-30T17:59:48Z)
Learning Graph Representation of Agent Diffuser [9.402103660431793]
Diffusion-based generative models have advanced text-to-image synthesis.<n>This transition suggests that static model parameters might not optimally address the distinct phases of generation.<n>We introduce LGR-AD, a novel multi-agent system designed to improve adaptability in dynamic computer vision tasks.
arXiv Detail & Related papers (2025-05-10T21:42:24Z)
Flow-GRPO: Training Flow Matching Models via Online RL [75.70017261794422]
We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models.<n>Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Equation (ODE) into an equivalent Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number.
arXiv Detail & Related papers (2025-05-08T17:58:45Z)
HuViDPO:Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment [13.320911720001277]
We introduce the strategy of Direct Preference Optimization (DPO) into text-to-video (T2V) tasks.<n>Existing T2V generation methods lack a well-formed pipeline with exact loss function to guide the alignment of generated videos with human preferences.
arXiv Detail & Related papers (2025-02-02T16:55:42Z)
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [77.86514804787622]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z)
Improving Video Generation with Human Feedback [81.48120703718774]
Video generation has achieved significant advances, but issues like unsmooth motion and misalignment between videos and prompts persist.<n>We develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model.<n>We introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy.
arXiv Detail & Related papers (2025-01-23T18:55:41Z)
Autoregressive Video Generation without Vector Quantization [90.87907377618747]
We reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction.<n>With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA.<n>Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity.
arXiv Detail & Related papers (2024-12-18T18:59:53Z)
CAR: Controllable Autoregressive Modeling for Visual Generation [100.33455832783416]
Controllable AutoRegressive Modeling (CAR) is a novel, plug-and-play framework that integrates conditional control into multi-scale latent variable modeling. CAR progressively refines and captures control representations, which are injected into each autoregressive step of the pre-trained model to guide the generation process. Our approach demonstrates excellent controllability across various types of conditions and delivers higher image quality compared to previous methods.
arXiv Detail & Related papers (2024-10-07T00:55:42Z)
Incorporating Reinforced Adversarial Learning in Autoregressive Image Generation [39.55651747758391]
We propose to use Reinforced Adversarial Learning (RAL) based on policy gradient optimization for autoregressive models. RAL also empowers the collaboration between different modules of the VQ-VAE framework. The proposed method achieves state-of-the-art results on Celeba for 64 $times$ 64 image resolution.
arXiv Detail & Related papers (2020-07-20T08:10:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.