Fugu-MT 論文翻訳(概要): Group Critical-token Policy Optimization for Autoregressive Image Generation

論文の概要: Group Critical-token Policy Optimization for Autoregressive Image Generation

arxiv url: http://arxiv.org/abs/2509.22485v1
Date: Fri, 26 Sep 2025 15:33:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.551255
Title: Group Critical-token Policy Optimization for Autoregressive Image Generation
Title（参考訳）: 自己回帰画像生成のためのグループ臨界情報ポリシー最適化
Authors: Guohui Zhang, Hu Yu, Xiaoxiao Ma, JingHao Zhang, Yaning Pan, Mingde Yao, Jie Xiao, Linjiang Huang, Feng Zhao,
Abstract要約: 主な障害は、AR生成中によりクリティカルな画像トークンを識別し、それらに対して効果的なトークンワイズ最適化を実装する方法にある。具体的には、3つの観点からRLVRベースのAR生成における重要なトークンを識別する: $textbf(1)$ CausalDepend: 初期トークンは、一方向依存性による後のトークンと最終的な画像効果を根本的に決定する。 ARモデルと統合マルチモーダルモデルのための複数のテキスト・画像ベンチマーク実験は、その効果を実証する。
参考スコア（独自算出の注目度）: 32.472222192052044
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies have extended Reinforcement Learning with Verifiable Rewards (RLVR) to autoregressive (AR) visual generation and achieved promising progress. However, existing methods typically apply uniform optimization across all image tokens, while the varying contributions of different image tokens for RLVR's training remain unexplored. In fact, the key obstacle lies in how to identify more critical image tokens during AR generation and implement effective token-wise optimization for them. To tackle this challenge, we propose $\textbf{G}$roup $\textbf{C}$ritical-token $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{GCPO}$), which facilitates effective policy optimization on critical tokens. We identify the critical tokens in RLVR-based AR generation from three perspectives, specifically: $\textbf{(1)}$ Causal dependency: early tokens fundamentally determine the later tokens and final image effect due to unidirectional dependency; $\textbf{(2)}$ Entropy-induced spatial structure: tokens with high entropy gradients correspond to image structure and bridges distinct visual regions; $\textbf{(3)}$ RLVR-focused token diversity: tokens with low visual similarity across a group of sampled images contribute to richer token-level diversity. For these identified critical tokens, we further introduce a dynamic token-wise advantage weight to encourage exploration, based on confidence divergence between the policy model and reference model. By leveraging 30\% of the image tokens, GCPO achieves better performance than GRPO with full tokens. Extensive experiments on multiple text-to-image benchmarks for both AR models and unified multimodal models demonstrate the effectiveness of GCPO for AR visual generation.
Abstract（参考訳）: 近年の研究では、Reinforcement Learning with Verifiable Rewards (RLVR) を自己回帰(AR)視覚生成に拡張し、有望な進歩を遂げている。しかし、既存の手法は通常、すべての画像トークンに対して一様最適化を適用するが、RLVRのトレーニングに対する異なる画像トークンの様々な貢献は、まだ探索されていない。実際、重要な障害は、AR生成中により重要な画像トークンを識別し、それらに対して効果的なトークンワイズ最適化を実装する方法にある。この課題に取り組むために、重要なトークンに対する効果的なポリシー最適化を容易にするために、$\textbf{G}$roup $\textbf{C}$ritical-token $\textbf{P}$olicy $\textbf{O}$ptimization$\textbf{GCPO}$を提案する。具体的には、RLVRベースのAR生成における重要なトークンを、3つの視点から識別する: $\textbf{(1)}$ CausalDepend: 初期トークンは、一方向依存性による後のトークンと最終的なイメージ効果を根本的に決定する。これらの重要なトークンを識別するために、ポリシーモデルと参照モデルとの信頼性の相違に基づく探索を促進するために、動的トークン単位の優位重みを導入する。画像トークンの30%を活用することで、GCPOは完全なトークンを持つGRPOよりも優れたパフォーマンスを実現する。 ARモデルと統合マルチモーダルモデルの両方を対象とした複数のテキスト・ツー・イメージベンチマークの大規模な実験は、AR視覚生成におけるGCPOの有効性を実証している。

論文の概要: Group Critical-token Policy Optimization for Autoregressive Image Generation

関連論文リスト