Fugu-MT 論文翻訳(概要): Toward Native Multimodal Modeling: A Roadmap

論文の概要: Toward Native Multimodal Modeling: A Roadmap

arxiv url: http://arxiv.org/abs/2605.25343v1
Date: Mon, 25 May 2026 01:57:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:19.236855
Title: Toward Native Multimodal Modeling: A Roadmap
Title（参考訳）: ネイティブマルチモーダルモデリングへの道程
Authors: Siyu An, Junru Lu, Junnan Dong, Qiufeng Wang, Yinghui Li, Weizhi Fei, Zichao Yu, Zheng Yuan, Biao Liu, Haopeng Wang, Renzhao Liang, Yixuan Yang, Yunhang Shen, Bo Ke, Keyu Chen, Linhao Luo, Difan Zou, Xiao Huang, Di Yin, Ruizhi Qiao, Xing Sun,
Abstract要約: マルチモーダルモデリングは、モダリティに依存しない推論から世界モデリングへの重要なステップである。近年の取り組みは、パラダイムをネイティブなマルチモーダルモデリングへとシフトさせてきた。その可能性にもかかわらず、ネイティブアーキテクチャの設計空間は未だ十分に定義されていない。
参考スコア（独自算出の注目度）: 73.2994129763275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal modeling represents a vital step from modality-agnostic reasoning toward world modeling. While early approaches predominantly rely on late-fusion that assembles encoders and frozen language backbones with output heads, recent efforts have shifted the paradigm toward native multimodal modeling (NMM) with the intrinsic integration of modalities for superior multimodal performance. Despite its potential, the design space of native architectures remains insufficiently defined. In this paper, we present the community with a formalized roadmap for this transition. Specifically, we formally define the architectural nativity, distinguishing mid-fusion and early-fusion from non-native paradigms. We further organize the existing native models through the lens of input-output duality into three categories: (i) Multi-to-Text for cross-modal comprehension with text-only output; (ii) Multi-to-Target for scenario-oriented generation, e.g., image, audio and video generation, and (iii) Multi-to-Multi for unified modeling with symmetric input-output. We deliver a comprehensive and industrial-grade investigation into the transition toward the definitive NMM framework, where understanding and generation seamlessly coexist within a unified transformer paradigm. We systematically unpack the end-to-end pipeline from industrial perspectives from architectural coordination, massive data curation, to full-stack training recipes, inference & deployment, and the comprehensive evaluation for truly native modeling.
Abstract（参考訳）: マルチモーダルモデリングは、モダリティに依存しない推論から世界モデリングへの重要なステップである。初期のアプローチはエンコーダとフリーズ言語バックボーンを出力ヘッドで組み立てる遅延融合に大きく依存していたが、近年の取り組みは、より優れたマルチモーダル性能のためのモダリティの本質的な統合により、ネイティブマルチモーダルモデリング(NMM)へとパラダイムをシフトしている。その可能性にもかかわらず、ネイティブアーキテクチャの設計空間は未だ十分に定義されていない。本稿では、この移行の正式なロードマップをコミュニティに提示する。具体的には、アーキテクチャのナビティリティを正式に定義し、ミッドフュージョンとアーリーフュージョンを非ネイティブパラダイムと区別します。さらに、入力出力双対性のレンズを通して既存のネイティブモデルを3つのカテゴリに分類する。 (i)テキストのみの出力によるクロスモーダル理解のためのマルチテキスト二シナリオ指向生成のためのマルチ・ツー・ターゲット、例えば、画像、オーディオ及びビデオ生成三対称入力出力を用いた統一モデリングのためのマルチ・ツー・マルチ我々は,統一トランスフォーマーパラダイム内での理解と生成がシームレスに共存する決定的なNMMフレームワークへの移行について,包括的で産業レベルの調査を行う。アーキテクチャ調整、大規模なデータキュレーション、フルスタックのトレーニングレシピ、推論とデプロイメント、真にネイティブなモデリングのための包括的な評価など、産業的な観点から、エンドツーエンドのパイプラインを体系的にアンパックします。

論文の概要: Toward Native Multimodal Modeling: A Roadmap

関連論文リスト