Xiaomi MiMo-VL-Miloco Technical Report
- URL: http://arxiv.org/abs/2512.17436v2
- Date: Mon, 22 Dec 2025 13:27:24 GMT
- Title: Xiaomi MiMo-VL-Miloco Technical Report
- Authors: Jiaze Li, Jingyang Chen, Yuxun Qu, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu, Jianzhong Ju, Zhenbo Luo, Jian Luan,
- Abstract summary: We open-source MiMo-VL-Miloco-7B and its quantized variant MiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models.<n>Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments.
- Score: 17.03705921238102
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We open-source MiMo-VL-Miloco-7B and its quantized variant MiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at https://github.com/XiaoMi/xiaomi-mimo-vl-miloco to support research and deployment in real-world smart-home applications.
Related papers
- MiMo-Embodied: X-Embodied Foundation Model Technical Report [53.335119478104644]
We open-source MiMo-Embodied, the first cross-embodied foundation model to achieve state-of-the-art performance in both Autonomous Driving and Embodied AI.<n>MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding.<n>Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines.
arXiv Detail & Related papers (2025-11-20T16:34:55Z) - MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs [61.70050081221131]
MVU-Eval is the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs.<n>Our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos.<n>These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics.
arXiv Detail & Related papers (2025-11-10T16:02:33Z) - NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints [100.02131897927484]
This paper focuses on the native training of Multimodal Large Language Models (MLLMs) in an end-to-end manner.<n>We propose a native MLLM called NaViL, combined with a simple and cost-effective recipe.<n> Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs.
arXiv Detail & Related papers (2025-10-09T17:59:37Z) - MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs [20.842336447426682]
MindVL is a multimodal large language model trained end-to-end on Ascend NPUs.<n>We introduce MindSpeed-MLLM, a highly efficient training framework that supports stable and high-performance training.<n>We find that averaging weights from checkpoints trained with different sequence lengths is particularly effective.
arXiv Detail & Related papers (2025-09-15T08:00:31Z) - MiMo-VL Technical Report [73.47820531501678]
We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models.<n>MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench.<n>For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G.
arXiv Detail & Related papers (2025-06-04T04:32:54Z) - SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant? [34.225988628142225]
We introduce SmartBench, the first benchmark designed to evaluate the capabilities of on-device LLMs in Chinese mobile contexts.<n>We construct high-quality datasets comprising 50 to 200 question-answer pairs that reflect everyday mobile interactions.<n>Our contributions provide a standardized framework for evaluating on-device LLMs in Chinese, promoting further development and optimization.
arXiv Detail & Related papers (2025-03-08T03:02:21Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected
Multi-Modal Large Models [76.99140362751787]
We present NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks.
We also present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View features.
arXiv Detail & Related papers (2024-01-02T01:54:22Z) - MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile
Devices [73.46317110474064]
MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices.
It comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion.
arXiv Detail & Related papers (2023-12-28T08:21:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.