Emerging Properties in Unified Multimodal Pretraining
- URL: http://arxiv.org/abs/2505.14683v3
- Date: Sun, 27 Jul 2025 11:45:16 GMT
- Title: Emerging Properties in Unified Multimodal Pretraining
- Authors: Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan,
- Abstract summary: We introduce BAGEL, an open-source foundational model that supports multimodal understanding and generation.<n>BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data.<n>It significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks.
- Score: 32.856334401494145
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/
Related papers
- HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation [69.34266162474836]
This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation.<n>We introduce feature pre-scaling and multimodal AdaLN techniques to address cross-modal compatibility challenges.<n>We present the Haplo Omni, a new single multimodal transformer.
arXiv Detail & Related papers (2025-06-03T15:14:00Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.<n>InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.<n>We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens.<n>MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z) - Multi-modal Generative AI: Multi-modal LLMs, Diffusions and the Unification [41.88402339122694]
Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry.<n>This paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation.
arXiv Detail & Related papers (2024-09-23T13:16:09Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation [27.773146599559286]
Anole is an open, autoregressive, native large multimodal model for interleaved image-text generation.
We have open-sourced our model, training framework, and instruction tuning data.
arXiv Detail & Related papers (2024-07-08T17:08:02Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.<n>SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.<n>We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.