Kimi Linear: An Expressive, Efficient Attention Architecture
- URL: http://arxiv.org/abs/2510.26692v2
- Date: Sat, 01 Nov 2025 12:05:18 GMT
- Title: Kimi Linear: An Expressive, Efficient Attention Architecture
- Authors: Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du,
- Abstract summary: Kimi Linear is a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons.<n>At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet.<n>We show that Kimi Linear can be a drop-in replacement for full attention with superior performance and efficiency.
- Score: 75.89211364086309
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
Related papers
- MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling [80.48332380100915]
MiniCPM-SALA is a hybrid model that integrates the high-fidelity long-context modeling of sparse attention with the global efficiency of linear attention.<n>On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens.
arXiv Detail & Related papers (2026-02-12T09:37:05Z) - Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction [3.9660062354591754]
Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity limits practical deployment.<n> Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation.<n>We introduce a greedy layer replacement strategy that iteratively substitutes full attention blocks with linear ones while monitoring validation performance on the target task.<n>This yields a task-specific hybrid model in a single efficient pass, without costly re-training or neural architecture search, and can be applied to any pretrained full-attention backbone for diverse downstream tasks.
arXiv Detail & Related papers (2026-01-16T02:01:40Z) - Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z) - DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference [19.987309147268586]
We present DeltaLLM, a training-free framework that exploits temporal sparsity in attention patterns to enable efficient LLM inference on resource-constrained edge devices.<n>We evaluate our framework on the edge-device-friendly BitNet-b1.58-2B-4T model and Llama3.2-1B-Instruct model across diverse language tasks.
arXiv Detail & Related papers (2025-07-25T18:23:18Z) - IAM: Efficient Inference through Attention Mapping between Different-scale LLMs [74.81417160018856]
IAM framework achieves dual benefits of accelerated attention computation and reduced KV cache usage.<n>We show that IAM can accelerate prefill by 15% and reduce KV cache usage by 22.1% without appreciably sacrificing performance.
arXiv Detail & Related papers (2025-07-16T06:39:11Z) - Comba: Improving Bilinear RNNs with Closed-loop Control [57.800320390698516]
We introduce the concept of Bilinear RNNs with a comprehensive analysis on the advantages and limitations of these models.<n>We propose a novel Bilinear RNN variant named Comba, which adopts a scalar-plus-low-rank state transition, with both state feedback and output feedback corrections.<n>We also implement a hardware-efficient chunk-wise parallel kernel in Triton and train models with 340M/1.3B parameters on large-scale corpus.
arXiv Detail & Related papers (2025-06-03T05:44:50Z) - Enhancing Reinforcement Learning for the Floorplanning of Analog ICs with Beam Search [0.32985979395737786]
This paper presents a hybrid method that combines reinforcement learning (RL) with a beam (BS) strategy.<n>The BS algorithm enhances the agent's inference process, allowing for the generation of flexible floorplans.<n> Experimental results show approx. 5-85% improvement in area, dead space and half-perimeter wire length compared to a standard RL application.
arXiv Detail & Related papers (2025-05-08T08:50:32Z) - Breaking the Low-Rank Dilemma of Linear Attention [61.55583836370135]
Linear attention provides a far more efficient solution by reducing the complexity to linear levels.<n>Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map.<n>We introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency.
arXiv Detail & Related papers (2024-11-12T08:30:59Z) - Optimal Parallelization Strategies for Active Flow Control in Deep Reinforcement Learning-Based Computational Fluid Dynamics [29.49913315698914]
Deep Reinforcement Learning (DRL) has emerged as a promising approach for handling highly dynamic and nonlinear Active Flow Control (AFC) problems.
This study focuses on optimizing DRL-based algorithms in parallel settings.
We achieve a significant boost in parallel efficiency from around 49% to approximately 78%.
arXiv Detail & Related papers (2024-02-18T09:07:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.