FuguReport

Toward Native Multimodal Modeling: A Roadmap

Authors Siyu An, Junru Lu, Junnan Dong, Qiufeng Wang, Yinghui Li, Weizhi Fei, Zichao Yu, Zheng Yuan, Biao Liu, Haopeng Wang, Renzhao Liang, Yixuan Yang, Yunhang Shen, Bo Ke, Keyu Chen, Linhao Luo, Difan Zou, Xiao Huang, Di Yin, Ruizhi Qiao, Xing Sun
Affiliations Tencent / The University of Hong Kong / Monash University / Tsinghua University / The Hong Kong Polytechnic University / University of Warwick
Categories Method / Multimodal Modeling / Native multimodal model design space, Task / Reasoning / Modality-independent inference, Research / Modeling Frameworks / Roadmap for native multimodal architectures
License CC BY 4.0

Abstract Overview

This paper is a roadmap and survey of native multimodal modeling (NMM), framed as a transition from modular late-fusion systems toward architectures in which modalities are integrated intrinsically. It formalizes what counts as "native" by distinguishing mid-fusion and early-fusion regimes and further organizes models by input-output symmetry into Multi-to-Text, Multi-to-Target, and Multi-to-Multi categories. Beyond architecture, the paper reviews the full NMM pipeline, including datasets, training, inference and deployment, evaluation, and future research directions.

Novelty

The paper's main novelty is a formalized design framework and structural taxonomy for native multimodal modeling. It proposes explicit definitions of architectural nativity based on fusion depth and input-output modality symmetry, structuring a previously fragmented design space.

Results

The primary outcome is a comprehensive roadmap that systematizes native multimodal models, their technical bottlenecks, and corresponding solution patterns across architecture, data, training, inference, and evaluation. It contributes a structured synthesis of the field and a forward-looking agenda toward unified multi-to-multi multimodal systems.

Key Points

  1. The paper defines native multimodal modeling by separating mid-fusion and early-fusion architectures from non-native late-fusion approaches.
  2. It categorizes native systems into Multi-to-Text, Multi-to-Target, and Multi-to-Multi paradigms to describe different input-output modality flows.
  3. It surveys the end-to-end NMM stack, covering representative models, dataset types, training recipes, inference challenges, and future trajectories.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.