Multi-head Temporal Latent Attention
- URL: http://arxiv.org/abs/2505.13544v3
- Date: Sun, 02 Nov 2025 20:27:27 GMT
- Title: Multi-head Temporal Latent Attention
- Authors: Keqi Deng, Philip C. Woodland,
- Abstract summary: Multi-head latent attention was recently developed to compress the Key-Value cache into a low-rank latent space.<n>This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension.<n>Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation, demonstrate that MTLA achieves competitive performance.
- Score: 27.475917680869657
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the KV cache into a low-rank latent space. This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension, greatly lowering the memory footprint of self-attention inference. MTLA employs a hyper-network to dynamically merge temporally adjacent KV cache vectors. To address the mismatch between the compressed KV cache and processed sequence lengths, a stride-aware causal mask is proposed to ensure efficient parallel training and consistency with inference behaviour. Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation, demonstrate that MTLA achieves competitive performance compared to standard Multi-Head Attention (MHA), while greatly improving inference speed and GPU memory usage. For example, on a English-German speech translation task, MTLA achieves a 5.3x speedup and a reduction in GPU memory usage by a factor of 8.3 compared to MHA, while maintaining translation quality.
Related papers
- Multi-Head Low-Rank Attention [22.28455391125486]
Multi-Head Latent Attention (MLA) significantly reduces the total KV cache size.<n>Since its single latent head cannot be partitioned, each device is forced to redundantly load the complete KV cache for every token.<n>We propose Multi-Head Low-Rank Attention (MLRA) which enables partitionable latent states for efficient 4-way TP decoding.
arXiv Detail & Related papers (2026-03-02T18:52:38Z) - DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity [50.52392445266824]
We propose a residual-based KV cache compression framework motivated by long-range inter-token similarity and highly shared latent components in KV representations.<n>Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage.<n>Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME.
arXiv Detail & Related papers (2026-02-08T15:14:36Z) - TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference [48.40143137402824]
Multi-Head Latent Attention (MLA) compresses key-value states into a low-rank latent vector, caching only this vector to reduce memory.<n>In tensor parallelism (TP), however, attention heads are computed across multiple devices, and each device must load the full cache.<n>We propose-Parallel Latent Attention (TPLA), a scheme that partitions both the latent representation and each head's input dimension across devices, performs attention independently per shard, and then combines results with an all-reduce.
arXiv Detail & Related papers (2025-08-21T15:25:40Z) - ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - dKV-Cache: The Cache for Diffusion Language Models [53.85291644298835]
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models.<n>We propose a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs.<n>Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process.
arXiv Detail & Related papers (2025-05-21T17:32:10Z) - SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching [9.617322424513317]
SentenceKV is a novel KV caching approach designed to enhance inference efficiency while preserving semantic coherence.<n>We show that SentenceKV significantly outperforms state-of-the-art methods in both efficiency and memory usage, without compromising model accuracy.
arXiv Detail & Related papers (2025-04-01T17:08:57Z) - Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs [6.222287867011644]
We propose MorphKV, an inference-time technique that maintains a constant-sized KV cache while preserving accuracy.<n>Unlike retention or lossy compression, MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens.<n>Our studies show 52.9$%$ memory savings and 18.2$%$ higher accuracy on average compared to state-of-the-art prior works.
arXiv Detail & Related papers (2025-03-02T18:12:50Z) - MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference [15.895020720304656]
MEDA is a dynamic layer-wise KV cache allocation method for efficient multimodal long-context inference.<n> MEDA achieves up to 72% KV cache memory reduction and 2.82 times faster decoding speed.
arXiv Detail & Related papers (2025-02-24T19:34:52Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process.<n>Our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference [61.412894960600205]
Large Language Models (LLMs) require significant GPU memory when processing long texts.<n>ChunkKV reimagines KV cache compression by treating semantic chunks as basic compression units.<n>Result: ChunkKV outperforms state-of-the-art methods by up to 8.7% in precision.
arXiv Detail & Related papers (2025-02-01T03:49:47Z) - MPCache: MPC-Friendly KV Cache Eviction for Efficient Private Large Language Model Inference [5.1206021159434805]
MPCache is built on the observation that historical tokens in a long sequence may have different effects on the downstream decoding.<n>MPCache consistently outperforms prior-art KV cache eviction baselines across different LLM generation tasks.
arXiv Detail & Related papers (2025-01-12T13:18:04Z) - SCBench: A KV Cache-Centric Analysis of Long-Context Methods [61.025422435235456]
We introduce SCBench, a benchmark for evaluating long-context methods from a KV cachecentric perspective.<n>We provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs and Mamba-Attention hybrids.<n>Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling perform robustly.
arXiv Detail & Related papers (2024-12-13T17:59:52Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference [32.20654044142376]
LOOK-M is a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size.
It achieves up to 1.5x faster decoding and also maintains or even enhances performance across a variety of long context multimodal tasks.
arXiv Detail & Related papers (2024-06-26T07:44:24Z) - MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [48.03117580340151]
Key-Value ( KV) cache stores key-value states of previously generated tokens.
The size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation.
We present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective.
arXiv Detail & Related papers (2024-05-23T09:43:52Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.