Fugu-MT 論文翻訳(概要): Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

論文の概要: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

arxiv url: http://arxiv.org/abs/2605.04128v1
Date: Tue, 05 May 2026 15:49:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-07 18:41:07.45464
Title: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
Title（参考訳）: 統合マルチモーダル理解・生成における空間的知能の覚醒
Authors: Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan,
Abstract要約: 視覚理解,テキスト・ツー・イメージ生成,指導誘導画像編集のための統合マルチモーダル基盤モデルであるJoyAI-Imageを提案する。我々は,一貫した命令チューニング,長文レンダリングの監督,空間的接地データ,一般および空間的編集信号を組み合わせたスケーラブルなトレーニングレシピを構築した。 JoyAI-Imageは、理解、生成、長文レンダリング、および編集ベンチマークにまたがる実験により、最先端または競争力の高いパフォーマンスを実現している。
参考スコア（独自算出の注目度）: 68.03746493619285
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.
Abstract（参考訳）: 視覚理解,テキスト・ツー・イメージ生成,指導誘導画像編集のための統合マルチモーダル基盤モデルであるJoyAI-Imageを提案する。 JoyAI-Imageは、空間的に拡張されたMultimodal Large Language Model (MLLM)とMultimodal Diffusion Transformer (MMDiT)を結合し、認識と生成を共有マルチモーダルインタフェースを介して対話できるようにする。このアーキテクチャの周辺では、一貫した命令チューニング、長文レンダリングの監督、空間的接地データ、および一般および空間的編集信号を組み合わせたスケーラブルなトレーニングレシピを構築している。この設計は、幾何認識推論と制御可能な視覚合成を強化しつつ、モデルに広いマルチモーダル能力を与える。 JoyAI-Imageは、理解、生成、長文レンダリング、および編集ベンチマークにまたがる実験により、最先端または競争力の高いパフォーマンスを実現している。さらに重要なことは、強化された理解、制御可能な空間編集、および新規ビュー支援推論の間の双方向ループにより、モデルがより強力な空間知性に向かって一般的な視覚能力を超えることができることである。これらの結果は、視覚-言語-アクションシステムや世界モデルのような下流アプリケーションにおいて、統一された視覚モデルにとって有望な経路であることを示唆している。

論文の概要: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

関連論文リスト