Fugu-MT 論文翻訳(概要): Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

論文の概要: Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

arxiv url: http://arxiv.org/abs/2604.04746v3
Date: Wed, 08 Apr 2026 01:34:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 14:06:05.075489
Title: Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
Title（参考訳）: ストロークを思い浮かべて - インターリーブ推論によるプロセス駆動画像生成
Authors: Lei Zhang, Junjiao Tian, Zhipeng Fan, Kunpeng Li, Jialiang Wang, Weifeng Chen, Markos Georgopoulos, Felix Juefei-Xu, Yuxiang Bao, Julian McAuley, Manling Li, Zecheng He,
Abstract要約: プロセス駆動画像生成は多段階のパラダイムで、合成をインターリーブな推論軌道に分解する。プロセス駆動生成の核となる課題は、中間状態のあいまいさに起因する。 2つの相補的な制約を維持する、密集したステップワイドな監視を通じてこの問題に対処する。
参考スコア（独自算出の注目度）: 59.262311672150055
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. However, can unified multimodal models trained on text-image interleaved datasets also imagine the chain of intermediate states? In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core challenge of process-driven generation stems from the ambiguity of intermediate states: how can models evaluate each partially-complete image? We address this through dense, step-wise supervision that maintains two complementary constraints: for the visual intermediate states, we enforce the spatial and semantic consistency; for the textual intermediate states, we preserve the prior visual knowledge while enabling the model to identify and correct prompt-violating elements. This makes the generation process explicit, interpretable, and directly supervisable. To validate proposed method, we conduct experiments under various text-to-image generation benchmarks.
Abstract（参考訳）: グローバルなレイアウトを計画し、粗いドラフトをスケッチし、詳細を検査し、精細化する。しかし、テキストイメージのインターリーブされたデータセットで訓練された統合マルチモーダルモデルは、中間状態の連鎖を想像できるだろうか? 本稿では,多段階のプロセス駆動画像生成手法を提案する。一つのステップで画像を生成するのではなく、複数のイテレーションにまたがってアプローチを展開し、それぞれがテキスト計画、ビジュアルドラフト、テキストリフレクション、ビジュアルリフレクションの4つのステージで構成されています。テキスト推論は、視覚状態がどのように進化するかを明確に規定し、生成した視覚中間体は、次のテキスト推論のラウンドを制約し、根拠とする。プロセス駆動生成の中核的な課題は、中間状態の曖昧さに起因している。視覚的中間状態に対しては、空間的整合性、意味的整合性、テキスト的中間状態に対しては、事前の視覚的知識を保ちながら、モデルが早期に違反する要素を識別し、修正することができる。これにより生成プロセスが明確で、解釈可能で、直接監視可能である。提案手法を検証するため,様々なテキスト・画像生成ベンチマークを用いて実験を行う。

論文の概要: Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

関連論文リスト