Fugu-MT 論文翻訳(概要): SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

論文の概要: SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

arxiv url: http://arxiv.org/abs/2605.12500v1
Date: Tue, 12 May 2026 17:59:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:57.088898
Title: SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
Title（参考訳）: SenseNova-U1: NEO統一アーキテクチャによるマルチモーダル理解と生成の統合
Authors: Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, Yan Li, Yubo Wang, Zhijie Cao, Zhiqian Lin, Zhitao Yang, Zhongang Cai, Yuwei Niu, Yue Zhu, Bo Liu, Chengguang Lv, Haojia Yu, Haozhe Xie, Hongli Wang, Jianan Fan, Jiaqi Li, Jiefan Lu, Jingcheng Ni, Junxiang Xu, Kaihuan Liang, Lianqiang Shi, Linjun Dai, Linyan Wang, Oscar Qian, Peng Gao, Pengfei Liu, Qingping Sun, Rui Shen, Ruisi Wang, Shengnan Ma, Shuang Yang, Siyi Xie, Siying Li, Tianbo Zhong, Xiangli Kong, Xuanke Shi, Yang Gao, Yongqiang Yao, Yves Wang, Zhengqi Bai, Zhengyu Lin, Zixin Yin, Wenxiu Sun, Ruihao Gong, Quan Wang, Lewei Lu, Lei Yang, Ziwei Liu, Dahua Lin,
Abstract要約: NEO-Unify上に構築されたネイティブ統一マルチモーダルパラダイムであるSenseNova-U1を紹介する。 SenseNova-U1-8B-MoT と SenseNova-U1-A3B-MoT の2つのネイティブ統一型について述べる。また,コミュニティ研究を支援するためのモデル設計,データ前処理,事前/後学習,推論戦略についても紹介する。
参考スコア（独自算出の注目度）: 110.20462227888915
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.
Abstract（参考訳）: 理解と生成は異なる問題として扱われ、断片化されたアーキテクチャ、カスケードパイプライン、不整合表現空間へと導かれる。この分割は単なるエンジニアリングアーティファクトではなく、ネイティブなマルチモーダルインテリジェンスの発生を妨げる構造的制限である、と我々は主張する。したがって、NEO-Unify上に構築されたネイティブ統一マルチモーダルパラダイムであるSenseNova-U1を導入する。 SenseNova-U1-8B-MoT と SenseNova-U1-A3B-MoT という2つのネイティブ統一型をそれぞれ、密度の高い (8B) と知識の混合 (30B-A3B) のベースライン上に構築します。第一原理から設計され、テキスト理解、視覚言語知覚、知識推論、エージェントによる意思決定、空間知能にまたがる最上位の理解のみのVLMと競合する。一方、それらは強いセマンティック一貫性と視覚的忠実さを提供し、従来的または知識に富んだ任意のイメージ(X2I)合成、複雑なテキストリッチインフォグラフィック生成、思考パターンの有無に関わらず、視覚言語生成のインターリーブに優れる。パフォーマンス以外にも、詳細なモデル設計、データ前処理、プレ/ポストトレーニング、およびコミュニティリサーチを支援するための推論戦略を示す。最後に重要なことは、私たちのモデルは知覚と生成を超えて、視覚言語アクション(VLA)と世界モデル(WM)のシナリオで強く機能することを示す予備的な証拠である。これは、モデルがモダリティ間を翻訳するのではなく、ネイティブな方法でモデル間で考え、行動するという、より広範なロードマップを指している。マルチモーダルAIはもはや、別々のシステムを接続することではなく、統合されたシステムを構築し、内部から現れるために必要な能力を信頼することである。

論文の概要: SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

関連論文リスト