Multi-Head Low-Rank Attention
- URL: http://arxiv.org/abs/2603.02188v1
- Date: Mon, 02 Mar 2026 18:52:38 GMT
- Title: Multi-Head Low-Rank Attention
- Authors: Songtao Liu, Hongwu Peng, Zhiwei Zhang, Zhengyu Chen, Yue Guo,
- Abstract summary: Multi-Head Latent Attention (MLA) significantly reduces the total KV cache size.<n>Since its single latent head cannot be partitioned, each device is forced to redundantly load the complete KV cache for every token.<n>We propose Multi-Head Low-Rank Attention (MLRA) which enables partitionable latent states for efficient 4-way TP decoding.
- Score: 22.28455391125486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long-context inference in large language models is bottlenecked by Key--Value (KV) cache loading during the decoding stage, where the sequential nature of generation requires repeatedly transferring the KV cache from off-chip High-Bandwidth Memory (HBM) to on-chip Static Random-Access Memory (SRAM) at each step. While Multi-Head Latent Attention (MLA) significantly reduces the total KV cache size, it suffers from a sharding bottleneck during distributed decoding via Tensor Parallelism (TP). Since its single latent head cannot be partitioned, each device is forced to redundantly load the complete KV cache for every token, consuming excessive memory traffic and diminishing TP benefits like weight sharding. In this work, we propose Multi-Head Low-Rank Attention (MLRA), which enables partitionable latent states for efficient 4-way TP decoding. Extensive experiments show that MLRA achieves state-of-the-art perplexity and downstream task performance, while also delivering a 2.8$\times$ decoding speedup over MLA. Code is available at https://github.com/SongtaoLiu0823/MLRA. Pretrained weights, along with the training and evaluation data, are available at https://huggingface.co/Soughing/MLRA.
Related papers
- Joint Encoding of KV-Cache Blocks for Scalable LLM Serving [3.3230675313521716]
Existing KV-cache compression methods rely on rigids, disrupt tensor layouts, or require specialized compute.<n>We propose joint encoding of KV-cache blocks, which fuses similar blocks across requests and input chunks into shared representations.<n>This alleviates the KV-cache memory bottleneck, supporting high-concurrency serving without specialized hardware.
arXiv Detail & Related papers (2026-01-06T14:50:58Z) - PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel [19.009329924868002]
LLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory.<n>This paper introduces PAT, a prefix-aware attention kernel implementation for LLM decoding that organizes execution with a pack-forward-merge paradigm.<n> PAT reduces attention latency by 67.4% on average and TPOT by 13.6-83.4% under the same configurations against state-of-the-art attention kernels.
arXiv Detail & Related papers (2025-11-27T11:10:30Z) - TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference [48.40143137402824]
Multi-Head Latent Attention (MLA) compresses key-value states into a low-rank latent vector, caching only this vector to reduce memory.<n>In tensor parallelism (TP), however, attention heads are computed across multiple devices, and each device must load the full cache.<n>We propose-Parallel Latent Attention (TPLA), a scheme that partitions both the latent representation and each head's input dimension across devices, performs attention independently per shard, and then combines results with an all-reduce.
arXiv Detail & Related papers (2025-08-21T15:25:40Z) - LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models [52.56008278458534]
LaCache is a training-free method for efficient and accurate generative inference of Large Language Models.<n>LaCache enables LLMs to address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory.
arXiv Detail & Related papers (2025-07-14T19:09:57Z) - Hardware-Efficient Attention for Fast Decoding [13.958883001629644]
Grouped Latent Attention (GLA) is a parallel-friendly latent attention paired with low-level optimizations for fast decoding.<n>Our optimized GLA kernel is up to 2$times$ faster than FlashMLA, for example, in a speculative decoding setting.
arXiv Detail & Related papers (2025-05-27T17:54:07Z) - Multi-head Temporal Latent Attention [27.475917680869657]
Multi-head latent attention was recently developed to compress the Key-Value cache into a low-rank latent space.<n>This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension.<n>Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation, demonstrate that MTLA achieves competitive performance.
arXiv Detail & Related papers (2025-05-19T02:09:41Z) - QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI.
KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.