Toward Native Multimodal Modeling: A Roadmap
Abstract Overview
This paper is a roadmap and survey of native multimodal modeling (NMM), framed as a transition from modular late-fusion systems toward architectures in which modalities are integrated intrinsically. It formalizes what counts as "native" by distinguishing mid-fusion and early-fusion regimes and further organizes models by input-output symmetry into Multi-to-Text, Multi-to-Target, and Multi-to-Multi categories. Beyond architecture, the paper reviews the full NMM pipeline, including datasets, training, inference and deployment, evaluation, and future research directions.
Novelty
The paper's main novelty is a formalized design framework and structural taxonomy for native multimodal modeling. It proposes explicit definitions of architectural nativity based on fusion depth and input-output modality symmetry, structuring a previously fragmented design space.
Results
The primary outcome is a comprehensive roadmap that systematizes native multimodal models, their technical bottlenecks, and corresponding solution patterns across architecture, data, training, inference, and evaluation. It contributes a structured synthesis of the field and a forward-looking agenda toward unified multi-to-multi multimodal systems.
Key Points
- The paper defines native multimodal modeling by separating mid-fusion and early-fusion architectures from non-native late-fusion approaches.
- It categorizes native systems into Multi-to-Text, Multi-to-Target, and Multi-to-Multi paradigms to describe different input-output modality flows.
- It surveys the end-to-end NMM stack, covering representative models, dataset types, training recipes, inference challenges, and future trajectories.
References
- arXiv: https://arxiv.org/abs/2605.25343v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2605.25343v1
- Hugging Face Papers: https://huggingface.co/papers/2605.25343