TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference
- URL: http://arxiv.org/abs/2508.15881v2
- Date: Mon, 25 Aug 2025 02:24:20 GMT
- Title: TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference
- Authors: Xiaojuan Tang, Fanxu Meng, Pingzhi Tang, Yuxuan Wang, Di Yin, Xing Sun, Muhan Zhang,
- Abstract summary: Multi-Head Latent Attention (MLA) compresses key-value states into a low-rank latent vector, caching only this vector to reduce memory.<n>In tensor parallelism (TP), however, attention heads are computed across multiple devices, and each device must load the full cache.<n>We propose-Parallel Latent Attention (TPLA), a scheme that partitions both the latent representation and each head's input dimension across devices, performs attention independently per shard, and then combines results with an all-reduce.
- Score: 48.40143137402824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, compresses key-value states into a low-rank latent vector, caching only this vector to reduce memory. In tensor parallelism (TP), however, attention heads are computed across multiple devices, and each device must load the full cache, eroding the advantage of MLA over Grouped Query Attention (GQA). We propose Tensor-Parallel Latent Attention (TPLA): a scheme that partitions both the latent representation and each head's input dimension across devices, performs attention independently per shard, and then combines results with an all-reduce. TPLA preserves the benefits of a compressed KV cache while unlocking TP efficiency. Unlike Grouped Latent Attention (GLA), every head in TPLA still leverages the full latent representation, maintaining stronger representational capacity. TPLA is drop-in compatible with models pre-trained using MLA: it supports MLA-style prefilling and enables efficient tensor-parallel decoding without retraining. Applying simple orthogonal transforms -- e.g., the Hadamard transform or PCA -- before TP slicing further mitigates cross-shard interference, yielding minimal accuracy degradation. By reducing the per-device KV cache for DeepSeek-V3 and Kimi-K2, we achieve 1.79x and 1.93x speedups, respectively, at a 32K-token context length while maintaining performance on commonsense and LongBench benchmarks. TPLA can be implemented with FlashAttention-3, enabling practical end-to-end acceleration.
Related papers
- Multi-Head Low-Rank Attention [22.28455391125486]
Multi-Head Latent Attention (MLA) significantly reduces the total KV cache size.<n>Since its single latent head cannot be partitioned, each device is forced to redundantly load the complete KV cache for every token.<n>We propose Multi-Head Low-Rank Attention (MLRA) which enables partitionable latent states for efficient 4-way TP decoding.
arXiv Detail & Related papers (2026-03-02T18:52:38Z) - DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity [50.52392445266824]
We propose a residual-based KV cache compression framework motivated by long-range inter-token similarity and highly shared latent components in KV representations.<n>Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage.<n>Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME.
arXiv Detail & Related papers (2026-02-08T15:14:36Z) - VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization [23.781285860723248]
Key-Value ( KV) cache introduces memory overhead during large language model (LLM) inference.<n>We propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference.<n>VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks.
arXiv Detail & Related papers (2025-10-07T17:35:28Z) - Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space [12.98205656003145]
Multi-headed Attention's (MHA) quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve.<n>We introduce Compressed Convolutional Attention (CCA), a novel attention method which down-projects queries, keys, and values and performs the entire attention operation inside the shared latent space.<n>Experiments show CCGQA consistently outperforms both Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) at equal KV-cache compression on dense and MoE models.
arXiv Detail & Related papers (2025-10-06T04:24:23Z) - ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization [21.229296254354878]
Key-Value cache in generative large language models (LLMs) introduces substantial memory overhead.<n>Existing works mitigate this burden by offloading or compressing the KV cache.<n>We propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading.
arXiv Detail & Related papers (2025-05-26T07:00:04Z) - Multi-head Temporal Latent Attention [14.410024368174872]
Multi-head latent attention was recently developed to compress the Key-Value cache into a low-rank latent space.<n>This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension.<n>Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation, demonstrate that MTLA achieves competitive performance.
arXiv Detail & Related papers (2025-05-19T02:09:41Z) - 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float [71.43026659686679]
Large Language Models (LLMs) have grown rapidly in size, creating challenges for efficient deployment on resource-constrained hardware.<n>We introduce Dynamic-Length Float (DFloat11), a compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model.
arXiv Detail & Related papers (2025-04-15T22:38:38Z) - Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers [55.87192133758051]
Diffusion Transformers (DiTs) have achieved state-of-the-art (SOTA) image generation quality but suffer from high latency and memory inefficiency.<n>We propose DiffCR, a dynamic DiT inference framework with differentiable compression ratios.
arXiv Detail & Related papers (2024-12-22T02:04:17Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning [63.43972993473501]
Token compression expedites the training and inference of Vision Transformers (ViTs)
However, when applied to downstream tasks, compression degrees are mismatched between training and inference stages.
We propose a model arithmetic framework to decouple the compression degrees between the two stages.
arXiv Detail & Related papers (2024-08-13T10:36:43Z) - KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.<n> Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision.<n>Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv Detail & Related papers (2024-01-31T18:58:14Z) - CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and
Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks.
We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed.
Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.