LongCat-Flash Technical Report
- URL: http://arxiv.org/abs/2509.01322v1
- Date: Mon, 01 Sep 2025 10:05:45 GMT
- Title: LongCat-Flash Technical Report
- Authors: Meituan LongCat Team, Bayan, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, Chengcheng Han, Chenguang Xi, Chi Zhang, Chong Peng, Chuan Qin, Chuyu Zhang, Cong Chen, Congkui Wang, Dan Ma, Daoru Pan, Defei Bu, Dengchang Zhao, Deyang Kong, Dishan Liu, Feiye Huo, Fengcun Li, Fubao Zhang, Gan Dong, Gang Liu, Gang Xu, Ge Li, Guoqiang Tan, Guoyuan Lin, Haihang Jing, Haomin Fu, Haonan Yan, Haoxing Wen, Haozhe Zhao, Hong Liu, Hongmei Shi, Hongyan Hao, Hongyin Tang, Huantian Lv, Hui Su, Jiacheng Li, Jiahao Liu, Jiahuan Li, Jiajun Yang, Jiaming Wang, Jian Yang, Jianchao Tan, Jiaqi Sun, Jiaqi Zhang, Jiawei Fu, Jiawei Yang, Jiaxi Hu, Jiayu Qin, Jingang Wang, Jiyuan He, Jun Kuang, Junhui Mei, Kai Liang, Ke He, Kefeng Zhang, Keheng Wang, Keqing He, Liang Gao, Liang Shi, Lianhui Ma, Lin Qiu, Lingbin Kong, Lingtong Si, Linkun Lyu, Linsen Guo, Liqi Yang, Lizhi Yan, Mai Xia, Man Gao, Manyuan Zhang, Meng Zhou, Mengxia Shen, Mingxiang Tuo, Mingyang Zhu, Peiguang Li, Peng Pei, Peng Zhao, Pengcheng Jia, Pingwei Sun, Qi Gu, Qianyun Li, Qingyuan Li, Qiong Huang, Qiyuan Duan, Ran Meng, Rongxiang Weng, Ruichen Shao, Rumei Li, Shizhe Wu, Shuai Liang, Shuo Wang, Suogui Dang, Tao Fang, Tao Li, Tefeng Chen, Tianhao Bai, Tianhao Zhou, Tingwen Xie, Wei He, Wei Huang, Wei Liu, Wei Shi, Wei Wang, Wei Wu, Weikang Zhao, Wen Zan, Wenjie Shi, Xi Nan, Xi Su, Xiang Li, Xiang Mei, Xiangyang Ji, Xiangyu Xi, Xiangzhou Huang, Xianpeng Li, Xiao Fu, Xiao Liu, Xiao Wei, Xiaodong Cai, Xiaolong Chen, Xiaoqing Liu, Xiaotong Li, Xiaowei Shi, Xiaoyu Li, Xili Wang, Xin Chen, Xing Hu, Xingyu Miao, Xinyan He, Xuemiao Zhang, Xueyuan Hao, Xuezhi Cao, Xunliang Cai, Xurui Yang, Yan Feng, Yang Bai, Yang Chen, Yang Yang, Yaqi Huo, Yerui Sun, Yifan Lu, Yifan Zhang, Yipeng Zang, Yitao Zhai, Yiyang Li, Yongjing Yin, Yongkang Lv, Yongwei Zhou, Yu Yang, Yuchen Xie, Yueqing Sun, Yuewen Zheng, Yuhua Wei, Yulei Qian, Yunfan Liang, Yunfang Tai, Yunke Zhao, Zeyang Yu, Zhao Zhang, Zhaohua Yang, Zhenchao Zhang, Zhikang Xia, Zhiye Zou, Zhizhao Zeng, Zhongda Su, Zhuofan Chen, Zijian Zhang, Ziwen Wang, Zixu Jiang, Zizhe Zhao, Zongyu Wang, Zunhai Su,
- Abstract summary: LongCat-Flash is a 560-billion- parameter Mixture-of-Experts (MoE) language model.<n>It is designed for both computational efficiency and advanced agentic capabilities.<n>We complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of $0.70 per million output tokens.
- Score: 165.66422346171862
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of \$0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: https://longcat.ai Hugging Face: https://huggingface.co/meituan-longcat GitHub: https://github.com/meituan-longcat
Related papers
- MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling [80.48332380100915]
MiniCPM-SALA is a hybrid model that integrates the high-fidelity long-context modeling of sparse attention with the global efficiency of linear attention.<n>On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens.
arXiv Detail & Related papers (2026-02-12T09:37:05Z) - Efficient Context Scaling with LongCat ZigZag Attention [39.95366576062524]
LongCat ZigZag Attention (LoZA) is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget.<n>LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases.
arXiv Detail & Related papers (2025-12-30T03:39:04Z) - LongCat-Flash-Omni Technical Report [131.47284063481922]
LongCat-Flash- Omni is an open-source omni-modal model with 560 billion parameters.<n>LongCat-Flash- Omni attains comprehensive multimodal capabilities while maintaining strong unimodal capability.<n>It achieves low-latency real-time audio-visual interaction.
arXiv Detail & Related papers (2025-10-31T21:58:15Z) - Introducing LongCat-Flash-Thinking: A Technical Report [116.75498493511026]
LongCat-Flash-Thinking is an efficient open-source Mixture-of-Experts (MoE) reasoning model.<n>Its advanced capabilities are cultivated through a meticulously crafted training process.<n>LongCat-Flash-Thinking achieves state-of-the-art performance among open-source models on a suite of complex reasoning tasks.
arXiv Detail & Related papers (2025-09-23T10:25:48Z) - LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling [27.045621004239067]
We propose LAWCAT, a novel linearization framework designed to efficiently transfer the capabilities of pre-trained transformers into a performant linear attention architecture.<n> LAWCAT integrates causal Conv1D layers to enhance local dependency modeling and employs normalized gated linear attention to improve generalization across varying context lengths.<n>Our evaluations demonstrate that, distilling Mistral-7B with only 1K-length sequences yields over 90% passkey retrieval accuracy up to 22K tokens.
arXiv Detail & Related papers (2025-09-22T22:43:44Z) - SpikingBrain: Spiking Brain-inspired Large Models [42.41339012839023]
SpikingBrain is a family of brain-inspired models for efficient long-context training and inference.<n>We develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM.<n>Our models significantly improve long-sequence training efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior.
arXiv Detail & Related papers (2025-09-05T17:34:00Z) - Scaling Intelligence: Designing Data Centers for Next-Gen Language Models [0.13332839594069593]
Large Language Models (LLMs), such as GPT-4 with 1.8 trillion parameters, demand a fundamental rethinking of data center architecture.<n>Our work provides a comprehensive co-design framework that jointly explores FLOPS, bandwidth and capacity, multiple network topologies.<n>We quantify the benefits of overlapping compute and communication, leveraging hardware-accelerated collectives, widening the scale-out domain, and increasing memory capacity.
arXiv Detail & Related papers (2025-06-17T22:29:37Z) - Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought [196.74837065805488]
Hunyuan-TurboS is a large hybrid Transformer-Mamba Mixture of Experts model.<n>It balances high performance and efficiency, offering substantial capabilities at lower inference costs.
arXiv Detail & Related papers (2025-05-21T12:11:53Z) - From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models [54.44375226381814]
Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling.<n>We introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens.<n>Our approach achieves state-of-the-art performance across a diverse set of long-context benchmarks.
arXiv Detail & Related papers (2025-04-08T16:58:58Z) - Training-Free Long-Context Scaling of Large Language Models [114.53296002607993]
We propose Dual Chunk Attention, which enables Llama2 70B to support context windows of more than 100k tokens without continual training.
By decomposing the attention for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens.
arXiv Detail & Related papers (2024-02-27T12:39:23Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.