Fugu-MT 論文翻訳(概要): InterleaveThinker: Reinforcing Agentic Interleaved Generation

論文の概要: InterleaveThinker: Reinforcing Agentic Interleaved Generation

arxiv url: http://arxiv.org/abs/2606.13679v2
Date: Fri, 12 Jun 2026 06:34:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 13:53:03.78753
Title: InterleaveThinker: Reinforcing Agentic Interleaved Generation
Title（参考訳）: InterleaveThinker: エージェントインターリーブ生成の強化
Authors: Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng, Zoey Guo, Ray Zhang, Hongsheng Li,
Abstract要約: 我々はInterleaveThinkerを紹介した。InterleaveThinkerは、インターリーブ生成機能を備えた既存のイメージジェネレータをサポートするように設計された最初のマルチエージェントパイプラインである。具体的には、イメージテキスト入力シーケンスを整理するためにプランナーエージェントを使用し、各ステップで必要な実行をイメージジェネレータに指示する。次に, 発電機の出力を評価し, 計画された指示から逸脱するサンプルを同定し, 再生指示を洗練するための批判エージェントを紹介する。
参考スコア（独自算出の注目度）: 37.528182608182554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.
Abstract（参考訳）: 最近の画像生成装置は、単一画像の生成と編集において、印象的なフォトリアリズムと命令追従能力を示している。しかし、アーキテクチャによって制約されているため、視覚的物語、ガイダンス、具体的操作において重要な応用を持つインターリーブド世代(テキストイメージシーケンス)は達成できない。最新のオープンソースUnified Multimodal Models (UMMs)でさえ、この点において限られた性能を示している。本稿では,既存の画像生成器にインターリーブ生成機能を持たせるために設計された,最初のマルチエージェントパイプラインであるInterleaveThinkerを紹介する。具体的には、イメージテキスト入力シーケンスを整理するためにプランナーエージェントを使用し、各ステップで必要な実行をイメージジェネレータに指示する。次に, 発電機の出力を評価し, 計画された指示から逸脱するサンプルを同定し, 再生指示を洗練するための批判エージェントを紹介する。このパイプラインを実装するために,Interleave-Planner-SFT-80kとInterleave-Critic-SFT-112kを構築し,フォーマットコールドスタートを実行する。そして、GRPOを用いた生成軌道におけるステップワイド命令補正機能を強化するために、Interleave-Critic-RL-13kを開発した。単一のインターリーブ生成軌道は25以上のジェネレータ呼び出しを伴う可能性があるため、全軌道を最適化することは計算的に不可能である。そこで我々は,1ステップのRLが生成軌道全体を効果的に導くことができるように,精度の高い報酬とステップワイズ報酬を提案する。その結果、InterleaveThinkerは様々な画像ジェネレータのパフォーマンスを向上させることがわかった。インターリーブ世代ベンチマークでは、Nano BananaやGPT-5に匹敵するパフォーマンスを実現している。例えば4ステップのFLUX.2-kleinでは、WISEとRISEでかなりの利得を観測する。

論文の概要: InterleaveThinker: Reinforcing Agentic Interleaved Generation

関連論文リスト