Fugu-MT 論文翻訳(概要): FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

論文の概要: FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

arxiv url: http://arxiv.org/abs/2604.06757v1
Date: Wed, 08 Apr 2026 07:22:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.394066
Title: FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching
Title（参考訳）: FlowInOne:イメージイン,イメージアウトフローマッチングとしてのマルチモーダル生成の統合
Authors: Junchao Yi, Rui Zhao, Jiahao Tang, Weixian Lei, Linjie Li, Qisheng Su, Zhengyuan Yang, Lijuan Wang, Xiaofeng Zhu, Alex Jinpeng Wang,
Abstract要約: FlowInOneは、純粋なビジュアルフローとしてマルチモーダル生成を再構成するフレームワークである。テキスト・ツー・イメージ生成、レイアウト誘導編集、ビジュアル・インストラクションを1つのコヒーレント・パラダイムで統一する。オープンソースモデルと競合する商用システムの両方を超越して、すべての統一された生成タスクにおける最先端のパフォーマンスを実現している。
参考スコア（独自算出の注目度）: 86.31254356971506
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space.
Abstract（参考訳）: マルチモーダル生成は、言語が視覚を指示するが、その内部で推論や生成ができない、テキスト駆動パイプラインによって長い間支配されてきた。我々は,テキスト記述,空間レイアウト,編集命令を含むすべてのモダリティを,単一の視覚表現に統合できるかどうかを問うことで,このパラダイムに挑戦する。本稿では,マルチモーダル生成を純粋に視覚的なフローとして再構成し,すべての入力を視覚的なプロンプトに変換し,単一フローマッチングモデルで制御されたクリーンなイメージイン,イメージアウトパイプラインを実現するフレームワークであるFlowInOneを提案する。この視覚中心の定式化は、クロスモーダルアライメントのボトルネック、ノイズスケジューリング、タスク固有のアーキテクチャブランチを自然に排除し、テキスト・ツー・イメージ生成を統一し、レイアウト誘導編集し、1つのコヒーレントなパラダイムの下で視覚的指示を行う。これをサポートするために、VisPrompt-5Mという、物理学を意識した力力学や軌道予測を含む様々なタスクにまたがる500万の視覚的プロンプトペアの大規模データセットを紹介し、命令忠実度、空間的精度、視覚的リアリズム、コンテンツ一貫性を評価する厳密な評価ベンチマークであるVP-Benchを紹介した。大規模な実験により、FlowInOneは、すべての統一された生成タスクにおいて最先端のパフォーマンスを達成し、オープンソースモデルと競合する商用システムの両方を超越し、知覚と創造が単一の連続的な視覚空間内で共存する、完全な視覚中心の生成モデリングのための新しい基盤を確立した。

論文の概要: FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

関連論文リスト