Fugu-MT 論文翻訳(概要): STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

論文の概要: STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

arxiv url: http://arxiv.org/abs/2606.17979v2
Date: Thu, 18 Jun 2026 14:00:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 13:55:51.711069
Title: STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training
Title（参考訳）: STAR: テキストから画像へのRL後処理のための時空間適応逆アロケーション
Authors: Jinjie Shen, Wei Deng, Xian Hu, Daiguo Zhou, Jian Luan,
Abstract要約: textbfSpatioTemporal Adaptive Reward (STAR) Allocation for RL post-training of text-to-image diffusion and flow model。 STARは、生成モデル内のテキストイメージの注意を使い、ユーザーがプロンプトで本当に関心を持っている中核コンテンツから始める。ステップやロールアウトによって動的に変化する空間割当マップを構築し、より関連する潜伏領域に同じグループ相対的な利点を割り当てる。
参考スコア（独自算出の注目度）: 11.804446262558175
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it with the same strength to the entire generative trajectory. However, text-to-image generation naturally has temporal and spatial structure: different denoising steps are responsible for different generation stages, and the content that truly determines text alignment often appears only in part of the image. This granularity mismatch makes it difficult for policy updates to focus on the generative components that actually affect the reward. To address this issue, we propose \textbf{SpatioTemporal Adaptive Reward (STAR) Allocation} for RL post-training of text-to-image diffusion and flow models. STAR uses text-image attention inside the generative model and starts from the core content that the user truly cares about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, and allocates the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. STAR then applies stronger policy updates to these regions through a spatially resolved policy objective. We use Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results show that STAR improves compositional semantic alignment, text rendering, and preference optimization without changing the external reward source, achieving $\mathbf{0.9759}$, $\mathbf{0.9757}$, and $\mathbf{23.60}$ on GenEval, OCR, and PickScore, respectively.
Abstract（参考訳）: テキスト・ツー・イメージ生成のための既存のRLポストトレーニング法は、通常、最終画像報酬を単一のスカラー・アドバンテージに変換し、生成軌道全体に対して同じ強度で適用する。しかし、テキスト・ツー・イメージ生成は時間的・空間的な構造を持ち、異なる認知段階が異なる生成段階に責任を持ち、真のテキストアライメントを決定する内容は画像の一部にのみ現れることが多い。この粒度のミスマッチは、ポリシー更新が報酬に実際に影響を及ぼす生成コンポーネントに焦点を当てるのを難しくする。この問題に対処するために,テキスト・ツー・イメージ拡散・フローモデルのRL後トレーニングのためのtextbf{SpatioTemporal Adaptive Reward (STAR) Allocation}を提案する。 STARは、生成モデル内のテキストイメージの注意を使い、ユーザーがプロンプトで本当に関心を持っている中核コンテンツから始める。ステップやロールアウトによって動的に変化する空間割当マップを構築し、計算オーバーヘッドがほとんどないより関連する潜在領域に同じグループ相対的な利点を割り当てる。そしてSTARは、空間的に解決された政策目標を通じて、これらの領域により強力なポリシー更新を適用します。我々は,安定拡散3.5媒体をベースモデルとし,GenEval,OCRテキストレンダリング,PickScoreの3つのタスクを評価する。実験の結果、STARは、外部の報酬源を変更することなく、コンポジションのセマンティックアライメント、テキストレンダリング、好みの最適化を改善し、それぞれ$\mathbf{0.9759}$、$\mathbf{0.9757}$、$\mathbf{23.60}$をGenEval、OCR、PickScoreで達成した。

論文の概要: STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

関連論文リスト