TransMLA: Multi-Head Latent Attention Is All You Need
- URL: http://arxiv.org/abs/2502.07864v2
- Date: Thu, 13 Feb 2025 18:07:04 GMT
- Title: TransMLA: Multi-Head Latent Attention Is All You Need
- Authors: Fanxu Meng, Zengwei Yao, Muhan Zhang,
- Abstract summary: We introduce Multi-head Latent Attention (MLA) to solve communication bottlenecks in large language models.<n>We show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold.<n>We plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models.
- Score: 22.354283924006786
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA employs an up-projection matrix to increase expressiveness, trading additional computation for reduced communication overhead. Although MLA has demonstrated efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on Group Query Attention (GQA) and have not announced any plans to adopt MLA. In this paper, we show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold. To encourage broader use of MLA, we introduce TransMLA, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo additional training to boost expressiveness without increasing the KV cache size. Furthermore, we plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models, thus enabling more efficient distillation of Deepseek R1.
Related papers
- Boosting Large Language Models with Mask Fine-Tuning [60.56962908455601]
We introduce Mask Fine-Tuning (MFT) to show that properly breaking the integrity of the model can surprisingly lead to improved performance.<n>Experiments show that MFT gains a consistent performance boost across various domains and backbones.
arXiv Detail & Related papers (2025-03-27T20:17:57Z) - X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression [23.023849840907594]
Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression.
We show that our proposed method can effectively compress the KV cache while preserving the performance on the benchmarks.
arXiv Detail & Related papers (2025-03-14T06:49:37Z) - Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs [74.74225314708225]
Multi-head Latent Attention (MLA) is an innovative architecture designed to ensure efficient and economical inference.
This paper proposes the first data-efficient fine-tuning method for transitioning from Multi-Head Attention to MLA.
arXiv Detail & Related papers (2025-02-20T18:50:42Z) - CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs [45.77132019859689]
CalibQuant is a visual quantization strategy that drastically reduces both memory and computational overhead.
We achieve a 10x throughput increase on InternVL models.
arXiv Detail & Related papers (2025-02-15T05:08:01Z) - Enabling Autoregressive Models to Fill In Masked Tokens [50.9948753314669]
This work introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that achieves state-of-the-art masked infilling performance.<n>MARIA combines a pre-trained and AR model by training a linear decoder that takes their hidden states as input.<n>Our results demonstrate that MARIA significantly outperforms existing methods, namely discrete diffusion models, on masked infilling tasks.
arXiv Detail & Related papers (2025-02-09T20:02:05Z) - MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization [15.01214559812713]
MQuant is a post-training quantization framework designed to tackle the challenges of multimodal large language models (MLLMs)<n>On five mainstream MLLMs (including Qwen-VL, Mini-V, CogVLM2), MQuant under W4A8 achieves near-floating-point accuracy (1% degradation) while reducing inference latency by up to 30%.
arXiv Detail & Related papers (2025-02-01T13:08:02Z) - Tensor Product Attention Is All You Need [53.69820973900921]
Product Attention (TPA) is a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly.<n>TPA achieves improved model quality alongside memory efficiency.<n>Based on TPA, we introduce the Product Attention Transformer,(T6), a new model architecture for sequence modeling.
arXiv Detail & Related papers (2025-01-11T03:37:10Z) - Multi-matrix Factorization Attention [59.10039136733939]
We propose Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR)<n>MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads.<n>MFA-KR further reduces memory requirements by repurposing the key cache as value.
arXiv Detail & Related papers (2024-12-26T15:45:45Z) - Anchor Attention, Small Cache: Code Generation with Large Language Models [15.94784908771546]
Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks.
We propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress contextual information.
It can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance.
arXiv Detail & Related papers (2024-11-11T02:47:05Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Spectra: Surprising Effectiveness of Pretraining Ternary Language Models at Scale [16.865532646589987]
This paper investigates the pretraining of low-bitwidth models specifically Ternary Language Models (TriLMs) as an alternative to traditional floating-point models (FloatLMs) and their post-training quantized versions (QuantLMs)
We present Spectra LLM suite, the first open suite of LLMs spanning multiple bit-widths, including FloatLMs, QuantLMs, and TriLMs, ranging from 99M to 3.9B parameters trained on 300B tokens.
arXiv Detail & Related papers (2024-07-17T05:53:20Z) - LCM: Locally Constrained Compact Point Cloud Model for Masked Point Modeling [47.94285833315427]
We propose a Locally constrained Compact point cloud Model (LCM) consisting of a locally constrained compact encoder and a locally constrained Mamba-based decoder.
Our encoder replaces self-attention with our local aggregation layers to achieve an elegant balance between performance and efficiency.
This decoder ensures linear complexity while maximizing the perception of point cloud geometry information from unmasked patches with higher information density.
arXiv Detail & Related papers (2024-05-27T13:19:23Z) - Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs)
The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference.
The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.<n> Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision.<n>Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv Detail & Related papers (2024-01-31T18:58:14Z) - Large Product Key Memory for Pretrained Language Models [12.932177565788974]
Product key memory (PKM) enables to improve prediction accuracy by increasing model capacity efficiently with insignificant computational overhead.
Motivated by the recent success of pretrained language models (PLMs), we investigate how to incorporate large PKM into PLMs that can be fine for a wide variety of downstream NLP tasks.
arXiv Detail & Related papers (2020-10-08T10:19:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.