MiMo-VL Technical Report
- URL: http://arxiv.org/abs/2506.03569v1
- Date: Wed, 04 Jun 2025 04:32:54 GMT
- Title: MiMo-VL Technical Report
- Authors: Xiaomi LLM-Core Team, :, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wang, Xinzhe Xu, Xingchen Song, Xing Zhang, Xing Yong, Xin Zhang, Xiangwei Deng, Wenyu Yang, Wenhan Ma, Weiwei Lv, Weiji Zhuang, Wei Liu, Sirui Deng, Shuo Liu, Shimao Chen, Shihua Yu, Shaohui Liu, Shande Wang, Rui Ma, Qiantong Wang, Peng Wang, Nuo Chen, Menghang Zhu, Kangyang Zhou, Kang Zhou, Kai Fang, Jun Shi, Jinhao Dong, Jiebao Xiao, Jiaming Xu, Huaqiu Liu, Hongshen Xu, Heng Qu, Haochen Zhao, Hanglong Lv, Guoan Wang, Duo Zhang, Dong Zhang, Di Zhang, Chong Ma, Chang Liu, Can Cai, Bingquan Xia,
- Abstract summary: We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models.<n>MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench.<n>For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G.
- Score: 73.47820531501678
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.
Related papers
- GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning [112.51671310005604]
We present GLM-4.1V-9B-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning.<n>We propose Reinforcement Learning with Curriculum Sampling to unlock the full potential of the model.<n>Open-source GLM-4.1V-9B-Thinking achieves state-of-the-art performance among models of comparable size.
arXiv Detail & Related papers (2025-07-01T17:55:04Z) - MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining [66.10635181116766]
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages.<n>MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed.<n>The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini.
arXiv Detail & Related papers (2025-05-12T14:30:11Z) - SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement [100.85923086072204]
We introduce ThinkLite-VL, a family of visual reasoning models that achieve state-of-the-art (SoTA) performance using an order of magnitude fewer training samples.<n>We use Monte Carlo Tree Search (MCTS) to measure sample difficulty via the number of reasoning iterations a vision-language model (VLM) requires to solve each instance.<n>ThinkLite-VL-7B and ThinkLite-VL-72B significantly outperform their respective base models across eight visual reasoning benchmarks.
arXiv Detail & Related papers (2025-04-10T17:49:05Z) - Kimi-VL Technical Report [88.78957513757784]
Kimi-VL is a vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities.<n>As a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models.<n>Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking.
arXiv Detail & Related papers (2025-04-10T06:48:26Z) - LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning [76.82159851648711]
We propose a framework that dynamically improves the embedding model's representation learning for negative pairs.<n>LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance.<n>LLaVE can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance.
arXiv Detail & Related papers (2025-03-04T10:21:57Z) - MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting [0.6675160100853794]
We curated a novel dataset, MM-PhyQA, which comprises well-constructed, high schoollevel multimodal physics problems.
For generating answers for questions consisting of multimodal input, we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset.
For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models.
arXiv Detail & Related papers (2024-04-11T07:11:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.