MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
- URL: http://arxiv.org/abs/2509.18154v1
- Date: Tue, 16 Sep 2025 19:41:48 GMT
- Title: MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
- Authors: Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning Ding, Xu Han, Yuan Yao, Zhiyuan Liu, Maosong Sun,
- Abstract summary: MiniCPM-V 4.5 is an 8B parameter model designed for high efficiency and strong performance.<n>We introduce three core improvements in model architecture, data strategy and training method.<n>MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size.
- Score: 68.04078852416248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7\% GPU memory cost and 8.7\% inference time of Qwen2.5-VL 7B.
Related papers
- MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs [20.842336447426682]
MindVL is a multimodal large language model trained end-to-end on Ascend NPUs.<n>We introduce MindSpeed-MLLM, a highly efficient training framework that supports stable and high-performance training.<n>We find that averaging weights from checkpoints trained with different sequence lengths is particularly effective.
arXiv Detail & Related papers (2025-09-15T08:00:31Z) - MiniCPM4: Ultra-Efficient LLMs on End Devices [126.22958722174583]
MiniCPM4 is a highly efficient large language model (LLM) designed explicitly for end-side devices.<n>We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
arXiv Detail & Related papers (2025-06-09T16:16:50Z) - EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z) - SmolVLM: Redefining small and efficient multimodal models [8.849350918179752]
We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference.<n>We identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints.<n>Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance.
arXiv Detail & Related papers (2025-04-07T17:58:57Z) - Apollo: An Exploration of Video Understanding in Large Multimodal Models [65.06400672040836]
We present a study that helps uncover what effectively drives video understanding in Large Multimodal Models.<n>Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with a 55.1 on LongVideoBench.<n>Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.
arXiv Detail & Related papers (2024-12-13T18:53:24Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [191.7830199016589]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.<n>InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.<n>We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones [18.954681684239358]
This study introduces TinyGPT-V, a novel open-source MLLM, designed for efficient training and inference across various vision-language tasks.
With its language model 2.8 billion parameters, TinyGPT-V achieves comparable results in VQA and image inference tasks to its larger counterparts.
arXiv Detail & Related papers (2023-12-28T07:11:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.