Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
- URL: http://arxiv.org/abs/2507.12566v1
- Date: Wed, 16 Jul 2025 18:31:23 GMT
- Title: Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
- Authors: Gen Luo, Wenhan Dou, Wenhao Li, Zhaokai Wang, Xue Yang, Changyao Tian, Hao Li, Weiyun Wang, Wenhai Wang, Xizhou Zhu, Yu Qiao, Jifeng Dai,
- Abstract summary: This paper focuses on monolithic Multimodal Large Language Models (MLLMs)<n>Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting.<n>To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning.
- Score: 70.59376970630387
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.
Related papers
- InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models [139.19991097260115]
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm.<n>In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs.<n>In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
arXiv Detail & Related papers (2025-04-14T17:59:25Z) - LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models [18.489240454283834]
We introduce Language-Guided Vision Token Pruning (LVPruning) for MLLMs.<n>LVPruning employs cross-attention modules to compute the importance of vision tokens based on their interaction with language tokens.<n>Experiments show that LVPruning can effectively reduce up to 90% of vision tokens by the middle layer of LLaVA-1.5.
arXiv Detail & Related papers (2025-01-23T13:31:51Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.<n>InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.<n>We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models [63.27511432647797]
We propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes.<n>We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V.
arXiv Detail & Related papers (2024-12-02T18:58:25Z) - Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training [48.455597568212944]
We present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure.<n>In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data.
arXiv Detail & Related papers (2024-10-10T17:59:22Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.