Fugu-MT 論文翻訳(概要): Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

論文の概要: Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

arxiv url: http://arxiv.org/abs/2605.29198v1
Date: Thu, 28 May 2026 00:17:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.571254
Title: Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization
Title（参考訳）: 離散的政策最適化のためのガイダンスコントラストトークンクレジット割り当て
Authors: Shufan Li, Konstantinos Kallidromitis, Akash Gokul Yusuke Kato, Kazuki Kozuka, Aditya Grover,
Abstract要約: GRPO や DAPO などのグループアドバンテージに基づく強化学習手法は,多様な領域で高い性能を示した。モデル予測を正と負のプロンプトで対比することにより,トークン単位のクレジット割り当てを可能にする新しいアルゴリズムであるguidance Contrastive Policy Optimization (GCPO)を提案する。 GCPOは、テキスト・ツー・イメージ生成とチェーン・オブ・プリーティングのベンチマークの両方でGRPOとDAPOのベースラインを一貫して上回っている。
参考スコア（独自算出の注目度）: 38.9467847203731
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including mathematical reasoning and text-to-image generation. However, their reliance on sample-level rewards introduces a key limitation as uniform credit assignment across all tokens fails to capture fine-grained, token-level contributions. To address this issue, we propose Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment by contrasting model predictions under positive and negative prompts. Rather than uniformly broadcasting sample-level advantages, GCPO assigns token-level advantages proportional to the difference between these contrastive predictions, allowing more precise and informative learning signals. Empirically, we find that GCPO emphasizes semantically relevant regions such as visual areas aligned with textual prompts in text-to-image generation, and critical keywords within reasoning traces for chain-of-thought tasks. Through extensive experiments, GCPO consistently outperforms GRPO and DAPO baselines on both text-to-image generation and chain-of-thought reasoning benchmarks, demonstrating its effectiveness as a general and scalable optimization strategy for discrete policy learning.
Abstract（参考訳）: GRPO や DAPO などのグループアドバンテージに基づく強化学習手法は,数学的推論やテキスト・ツー・イメージ生成など,様々な領域で高い性能を発揮している。しかしながら、サンプルレベルの報酬への依存は、すべてのトークンに対する均一なクレジット割り当てが、きめ細かいトークンレベルのコントリビューションの取得に失敗するため、重要な制限をもたらす。この問題に対処するため、我々は、正と負のプロンプトの下でモデル予測を対比することにより、トークン単位のクレジット割り当てを可能にする新しいアルゴリズムである Guidance Contrastive Policy Optimization (GCPO) を提案する。サンプルレベルの利点を均一に放送するのではなく、GCPOはこれらの対照的な予測の差に比例してトークンレベルの利点を割り当て、より正確で情報的な学習信号を可能にする。経験的に、GCPOはテキスト・ツー・イメージ生成におけるテキスト・プロンプトに整合した視覚領域や、チェーン・オブ・ザ・シークレット・タスクの推論トレース内の重要なキーワードなど、意味的に関係のある領域を強調している。広範な実験を通じて、GCPOはGRPOとDAPOのベースラインをテキスト・ツー・イメージ・ジェネレーションとチェーン・オブ・ソート・推論のベンチマークの両方で一貫して上回り、離散的なポリシー学習のための汎用的でスケーラブルな最適化戦略としての有効性を実証した。

論文の概要: Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

関連論文リスト