Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation
- URL: http://arxiv.org/abs/2510.21583v1
- Date: Fri, 24 Oct 2025 15:50:36 GMT
- Title: Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation
- Authors: Yifu Luo, Penghui Du, Bo Li, Sinan Du, Tiantian Zhang, Yongzhe Chang, Kai Wu, Kun Gai, Xueqian Wang,
- Abstract summary: Group Relative Policy Optimization (GRPO) has shown strong potential for flow-matching-based text-to-image (T2I) generation.<n>We argue that shifting the optimization paradigm from the step level to the chunk level can effectively alleviate these issues.<n>Chunk-GRPO is the first chunk-level GRPO-based approach for T2I generation.
- Score: 29.015994347609936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Group Relative Policy Optimization (GRPO) has shown strong potential for flow-matching-based text-to-image (T2I) generation, but it faces two key limitations: inaccurate advantage attribution, and the neglect of temporal dynamics of generation. In this work, we argue that shifting the optimization paradigm from the step level to the chunk level can effectively alleviate these issues. Building on this idea, we propose Chunk-GRPO, the first chunk-level GRPO-based approach for T2I generation. The insight is to group consecutive steps into coherent 'chunk's that capture the intrinsic temporal dynamics of flow matching, and to optimize policies at the chunk level. In addition, we introduce an optional weighted sampling strategy to further enhance performance. Extensive experiments show that ChunkGRPO achieves superior results in both preference alignment and image quality, highlighting the promise of chunk-level optimization for GRPO-based methods.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks [23.119173310662365]
Group-based reinforcement learning (RL) has advanced the capabilities of large language models on long-horizon agentic tasks.<n>We find a key issue in estimating stepwise relative advantages, namely context inconsistency, where steps within the same group may differ in their historical contexts.<n>We propose HGPO, which assigns each step to multiple hierarchical groups according to the consistency of historical contexts.
arXiv Detail & Related papers (2026-02-26T09:58:10Z) - iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization [97.18886232580131]
Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration.<n>We propose Turn-Level GRPO, a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization.
arXiv Detail & Related papers (2026-01-23T06:21:33Z) - Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models [37.48289959306949]
Group Relative Policy Optimization is a powerful technique for aligning generative models.<n>But its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs.<n>We propose Pro-GRPO, a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process.
arXiv Detail & Related papers (2025-12-17T11:44:34Z) - Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment [61.80228667422234]
VGPO redefines value estimation across both temporal and group dimensions.<n>It transforms the sparse terminal reward into dense, process-aware value estimates.<n>It replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal.
arXiv Detail & Related papers (2025-12-13T16:31:26Z) - RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling [59.088798018184235]
textbfRAPO++ is a cross-stage prompt optimization framework.<n>It unifies training-data-aligned refinement, test-time iterative scaling, and large language model fine-tuning.<n> RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility.
arXiv Detail & Related papers (2025-10-23T04:45:09Z) - ACPO: Adaptive Curriculum Policy Optimization for Aligning Vision-Language Models in Complex Reasoning [17.928214942495412]
ACPO employs a dynamic curriculum that orchestrates a principled transition from a stable, near on-policy exploration phase to an efficient, off-policy exploitation phase.<n>We conduct extensive experiments on a suite of challenging multimodal reasoning benchmarks, including MathVista, LogicVista, and MMMU-Pro.<n>Results demonstrate that ACPO consistently outperforms strong baselines such as DAPO and PAPO, achieving state-of-the-art performance, accelerated convergence, and superior training stability.
arXiv Detail & Related papers (2025-10-01T09:11:27Z) - Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning [34.75717081153747]
Current methods for scoring generated images are susceptible to reward hacking.<n>We propose Pref-GRPO, which shifts the optimization objective from score to preference fitting, ensuring more stable training.<n>Existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment.<n>We introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes.
arXiv Detail & Related papers (2025-08-28T13:11:24Z) - On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence [2.8165669455824696]
Group Relative Policy Optimization is a critic-free reinforcement learning algorithm.<n>We show that GRPO update rule estimates the policy gradient at the old policy rather than the current one.<n>We propose a new algorithm: Trajectory level Importance Corrected GRPO.
arXiv Detail & Related papers (2025-08-04T19:01:19Z) - VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization [59.39976343879587]
VerIPO aims to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains.<n>The training loop benefits from GRPO's expansive search and DPO's targeted optimization.<n>Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs.
arXiv Detail & Related papers (2025-05-25T06:41:28Z) - Group Relative Policy Optimization for Image Captioning [1.9606373630214207]
We propose using the latest Group Relative Policy Optimization (GRPO) reinforcement learning algorithm as an optimization solution for the second stage.<n>By constraining the amplitude of policy updates and KL divergence, the stability of the model during training is greatly guaranteed.
arXiv Detail & Related papers (2025-03-03T09:16:41Z) - Unleashing the Potential of Large Language Models as Prompt Optimizers: Analogical Analysis with Gradient-based Model Optimizers [108.72225067368592]
We propose a novel perspective to investigate the design of large language models (LLMs)-based prompts.<n>We identify two pivotal factors in model parameter learning: update direction and update method.<n>We develop a capable Gradient-inspired Prompt-based GPO.
arXiv Detail & Related papers (2024-02-27T15:05:32Z) - Bidirectional Looking with A Novel Double Exponential Moving Average to
Adaptive and Non-adaptive Momentum Optimizers [109.52244418498974]
We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework.
We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
arXiv Detail & Related papers (2023-07-02T18:16:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.