MiMo-V2-Flash Technical Report
- URL: http://arxiv.org/abs/2601.02780v2
- Date: Thu, 08 Jan 2026 05:52:17 GMT
- Title: MiMo-V2-Flash Technical Report
- Authors: Xiaomi LLM-Core Team, :, Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, Gang Xie, Hailin Zhang, Hanglong Lv, Hanyu Li, Heyu Chen, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Lei Li, Liang Zhao, Linghao Zhang, Peidian Li, Qianli Chen, Shaohui Liu, Shihua Yu, Shijie Cao, Shimao Chen, Shouqiu Yu, Shuo Liu, Tianling Zhou, Weijiang Su, Weikun Wang, Wenhan Ma, Xiangwei Deng, Bohan Mao, Bowen Ye, Can Cai, Chenghua Wang, Chengxuan Zhu, Chong Ma, Chun Chen, Chunan Li, Dawei Zhu, Deshan Xiao, Dong Zhang, Duo Zhang, Fangyue Liu, Feiyu Yang, Fengyuan Shi, Guoan Wang, Hao Tian, Hao Wu, Heng Qu, Hongfei Yi, Hongxu An, Hongyi Guan, Xing Zhang, Yifan Song, Yihan Yan, Yihao Zhao, Yingchun Lai, Yizhao Gao, Yu Cheng, Yuanyuan Tian, Yudong Wang, Zhen Tang, Zhengju Tang, Zhengtao Wen, Zhichao Song, Zhixian Zheng, Zihan Jiang, Jian Wen, Jiarui Sun, Jiawei Li, Jinlong Xue, Jun Xia, Kai Fang, Menghang Zhu, Nuo Chen, Qian Tu, Qihao Zhang, Qiying Wang, Rang Li, Rui Ma, Shaolei Zhang, Shengfan Wang, Shicheng Li, Shuhao Gu, Shuhuai Ren, Sirui Deng, Tao Guo, Tianyang Lu, Weiji Zhuang, Weikang Zhang, Weimin Xiong, Wenshan Huang, Wenyu Yang, Xin Zhang, Xing Yong, Xu Wang, Xueyang Xie, Yilin Jiang, Yixin Yang, Yongzhe He, Yu Tu, Yuanliang Dong, Yuchen Liu, Yue Ma, Yue Yu, Yuxing Xiang, Zhaojun Huang, Zhenru Lin, Zhipeng Xu, Zhiyang Chen, Zhonghua Deng, Zihan Zhang, Zihao Yue,
- Abstract summary: We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters.<n>MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention.<n>The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k.
- Score: 101.08860732385551
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.
Related papers
- LFM2 Technical Report [87.58431408281973]
We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities.<n>The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active)<n>We build multimodal and retrieval variants: LFM2-VL for vision-latency tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval.
arXiv Detail & Related papers (2025-11-28T17:56:35Z) - HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap [17.1806530983927]
We introduce HierMoE to accelerate the training of large language models (LLMs) by two topology-aware techniques.<n>Our prototype HierMoE achieves $1.55times$ to $3.32times$ faster communication and delivers $1.18times$ to $1.27times$ faster end-to-end training.
arXiv Detail & Related papers (2025-08-13T08:16:31Z) - SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation [82.53411922988039]
We introduce SlimMoE, a multi-stage compression framework for transforming large MoE models into much smaller, efficient variants.<n>Using this framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to create Phi-mini-MoE (7.6B total/2.4B activated parameters) and Phi-tiny-MoE (3.8B total/1.1B activated parameters)<n>Our experiments demonstrate that these compressed models outperform others of similar size and remain competitive with larger models.
arXiv Detail & Related papers (2025-06-23T07:15:59Z) - 2 OLMo 2 Furious [154.15728448754854]
We present OLMo 2, the next generation of our fully open language models.<n> OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales.<n>We describe our modified model architecture and training recipe.
arXiv Detail & Related papers (2024-12-31T21:55:10Z) - MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs.
We propose the MobileMamba framework, which balances efficiency and performance.
MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z) - Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules [96.21649779507831]
We propose a novel architecture dubbed mixture-of-modules (MoM)
MoM is motivated by an intuition that any layer, regardless of its position, can be used to compute a token.
We show that MoM provides not only a unified framework for Transformers but also a flexible and learnable approach for reducing redundancy.
arXiv Detail & Related papers (2024-07-09T08:50:18Z) - MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services [32.278096820269816]
We present a novel MoESys that boosts efficiency in both large-scale training and inference.
Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage.
For scalable inference in a single node, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference.
arXiv Detail & Related papers (2022-05-20T09:09:27Z) - Beyond Distillation: Task-level Mixture-of-Experts for Efficient
Inference [17.97893143555333]
Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation.
In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation.
Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models.
arXiv Detail & Related papers (2021-09-24T20:42:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.