Related papers: DanceGRPO: Unleashing GRPO on Visual Generation

DanceGRPO: Unleashing GRPO on Visual Generation

URL: http://arxiv.org/abs/2505.07818v4
Date: Thu, 28 Aug 2025 17:19:45 GMT
Title: DanceGRPO: Unleashing GRPO on Visual Generation
Authors: Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, Ping Luo,
Abstract summary: Reinforcement Learning (RL) has emerged as a promising approach for fine-tuning generative models.<n>Existing methods like DDPO and DPOK face fundamental limitations when scaling to large and diverse prompt sets.<n>This paper presents DanceGRPO, a framework that addresses these limitations through an innovative adaptation of Group Relative Policy Optimization.
Score: 42.567425922760144
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in generative AI have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. While Reinforcement Learning (RL) has emerged as a promising approach for fine-tuning generative models, existing methods like DDPO and DPOK face fundamental limitations - particularly their inability to maintain stable optimization when scaling to large and diverse prompt sets, severely restricting their practical utility. This paper presents DanceGRPO, a framework that addresses these limitations through an innovative adaptation of Group Relative Policy Optimization (GRPO) for visual generation tasks. Our key insight is that GRPO's inherent stability mechanisms uniquely position it to overcome the optimization challenges that plague prior RL-based approaches on visual generation. DanceGRPO establishes several significant advances: First, it demonstrates consistent and stable policy optimization across multiple modern generative paradigms, including both diffusion models and rectified flows. Second, it maintains robust performance when scaling to complex, real-world scenarios encompassing three key tasks and four foundation models. Third, it shows remarkable versatility in optimizing for diverse human preferences as captured by five distinct reward models assessing image/video aesthetics, text-image alignment, video motion quality, and binary feedback. Our comprehensive experiments reveal that DanceGRPO outperforms baseline methods by up to 181\% across multiple established benchmarks, including HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis.

Related papers

Unified Personalized Reward Model for Vision Generation [27.496220369122494]
We propose UnifiedReward-Flex, a unified personalized reward model for vision generation.<n>We first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT.<n>We then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment.
arXiv Detail & Related papers (2026-02-02T17:44:21Z)
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation [31.201343197395573]
Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive ( VAR) models.<n>Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts.<n>We propose a novel framework to enhance Group Relative Policy Optimization ( GRPO) by explicitly managing these conflicts.
arXiv Detail & Related papers (2026-01-05T16:36:40Z)
Generative Actor Critic [74.04971271003869]
Generative Actor Critic (GAC) is a novel framework that decouples sequential decision-making by reframing textitpolicy evaluation as learning a generative model of the joint distribution over trajectories and returns.<n>Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC's strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-12-25T06:31:11Z)
DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO [50.89703227426486]
Reinforcement learning (RL) improves image generation quality significantly by comparing the relative performance of images generated within the same group.<n>In the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity.<n>This issue can be analyzed from both reward modeling and generation dynamics perspectives.
arXiv Detail & Related papers (2025-12-25T05:37:37Z)
Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models [31.470613363668672]
Adaptive Divergence Regularized Policy Optimization automatically adjusts regularization strength based on advantage estimates.<n>Our implementation with Wasserstein-2 regularization for flow matching generative models achieves remarkable results on text-to-image generation.<n> ADRPO generalizes to KL-regularized fine-tuning of both text-only LLMs and multi-modal reasoning models.
arXiv Detail & Related papers (2025-10-20T19:46:02Z)
Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition [52.232968183793986]
General Policy Composition (GPC) is a training-free method that enhances performance by combining the distributional scores of multiple pre-trained policies.<n>GPC consistently improves performance and adaptability across a diverse set of tasks.
arXiv Detail & Related papers (2025-10-01T16:05:53Z)
Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances [8.56304683490938]
Reinforcement learning offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives.<n>Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks.
arXiv Detail & Related papers (2025-08-14T03:44:03Z)
Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling [80.30976039119236]
Lumina-mGPT 2.0 is a stand-alone, decoder-only autoregressive model.<n>It is trained entirely from scratch, enabling unrestricted architectural design and licensing freedom.<n>It achieves generation quality on par with state-of-the-art diffusion models.
arXiv Detail & Related papers (2025-07-23T17:42:13Z)
SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution [55.14432034345353]
We study key design principles for latter cascaded video super-resolution models, which are underexplored currently.<n>First, we propose two strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator.<n>Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs.
arXiv Detail & Related papers (2025-06-24T17:57:26Z)
Seedance 1.0: Exploring the Boundaries of Video Generation Models [71.26796999246068]
Seedance 1.0 is a high-performance and inference-efficient video foundation generation model.<n>It integrates multi-source curation data augmented with precision and meaningful video captioning.<n>Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds ( NVIDIA-L20)
arXiv Detail & Related papers (2025-06-10T17:56:11Z)
ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL [54.100889131719626]
Chain-of-thought reasoning and reinforcement learning have driven breakthroughs in NLP.<n>We introduce ReasonGen-R1, a framework that imbues an autoregressive image generator with explicit text-based "thinking" skills.<n>We show that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models.
arXiv Detail & Related papers (2025-05-30T17:59:48Z)
Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO [68.44918104224818]
Autoregressive image generation presents unique challenges distinct from Chain-of-Thought (CoT) reasoning.<n>This study provides the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation.<n>Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms.
arXiv Detail & Related papers (2025-05-22T17:59:49Z)
Learning Graph Representation of Agent Diffuser [9.402103660431793]
Diffusion-based generative models have advanced text-to-image synthesis.<n>This transition suggests that static model parameters might not optimally address the distinct phases of generation.<n>We introduce LGR-AD, a novel multi-agent system designed to improve adaptability in dynamic computer vision tasks.
arXiv Detail & Related papers (2025-05-10T21:42:24Z)
Flow-GRPO: Training Flow Matching Models via Online RL [75.70017261794422]
We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models.<n>Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Equation (ODE) into an equivalent Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number.
arXiv Detail & Related papers (2025-05-08T17:58:45Z)
HuViDPO:Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment [13.320911720001277]
We introduce the strategy of Direct Preference Optimization (DPO) into text-to-video (T2V) tasks.<n>Existing T2V generation methods lack a well-formed pipeline with exact loss function to guide the alignment of generated videos with human preferences.
arXiv Detail & Related papers (2025-02-02T16:55:42Z)
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [77.86514804787622]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z)
Improving Video Generation with Human Feedback [81.48120703718774]
Video generation has achieved significant advances, but issues like unsmooth motion and misalignment between videos and prompts persist.<n>We develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model.<n>We introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy.
arXiv Detail & Related papers (2025-01-23T18:55:41Z)
Autoregressive Video Generation without Vector Quantization [90.87907377618747]
We reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction.<n>With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA.<n>Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity.
arXiv Detail & Related papers (2024-12-18T18:59:53Z)
CAR: Controllable Autoregressive Modeling for Visual Generation [100.33455832783416]
Controllable AutoRegressive Modeling (CAR) is a novel, plug-and-play framework that integrates conditional control into multi-scale latent variable modeling. CAR progressively refines and captures control representations, which are injected into each autoregressive step of the pre-trained model to guide the generation process. Our approach demonstrates excellent controllability across various types of conditions and delivers higher image quality compared to previous methods.
arXiv Detail & Related papers (2024-10-07T00:55:42Z)
Incorporating Reinforced Adversarial Learning in Autoregressive Image Generation [39.55651747758391]
We propose to use Reinforced Adversarial Learning (RAL) based on policy gradient optimization for autoregressive models. RAL also empowers the collaboration between different modules of the VQ-VAE framework. The proposed method achieves state-of-the-art results on Celeba for 64 $times$ 64 image resolution.
arXiv Detail & Related papers (2020-07-20T08:10:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.