Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
- URL: http://arxiv.org/abs/2508.20751v1
- Date: Thu, 28 Aug 2025 13:11:24 GMT
- Title: Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
- Authors: Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang,
- Abstract summary: Current methods for scoring generated images are susceptible to reward hacking.<n>We propose Pref-GRPO, which shifts the optimization objective from score to preference fitting, ensuring more stable training.<n>Existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment.<n>We introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes.
- Score: 34.75717081153747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.
Related papers
- iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment [28.18756041538092]
We present TAGRPO, a robust framework for I2V models inspired by contrastive learning.<n>Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization.
arXiv Detail & Related papers (2026-01-09T11:15:27Z) - Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment [61.80228667422234]
VGPO redefines value estimation across both temporal and group dimensions.<n>It transforms the sparse terminal reward into dense, process-aware value estimates.<n>It replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal.
arXiv Detail & Related papers (2025-12-13T16:31:26Z) - Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation [18.178060190776858]
We introduce BPGO, a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor.<n>BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.
arXiv Detail & Related papers (2025-11-24T09:29:30Z) - Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation [29.015994347609936]
Group Relative Policy Optimization (GRPO) has shown strong potential for flow-matching-based text-to-image (T2I) generation.<n>We argue that shifting the optimization paradigm from the step level to the chunk level can effectively alleviate these issues.<n>Chunk-GRPO is the first chunk-level GRPO-based approach for T2I generation.
arXiv Detail & Related papers (2025-10-24T15:50:36Z) - GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z) - Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
We propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity.<n>Specifically, we employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We also reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z) - T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z) - GPT Meets Graphs and KAN Splines: Testing Novel Frameworks on Multitask Fine-Tuned GPT-2 with LoRA [0.0]
We explore the potential of integrating learnable and interpretable modules--specifically Kolmogorov-Arnold Networks (KAN) and graph-based representations--within a pre-trained GPT-2 model.
arXiv Detail & Related papers (2025-03-25T19:58:25Z) - Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [86.69947123512836]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z) - Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation [40.74782694945025]
We propose Parrot, which addresses the issue of manually adjusting reward weights.
We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network.
We also introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion.
arXiv Detail & Related papers (2024-01-11T05:36:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.