EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs
- URL: http://arxiv.org/abs/2509.16686v1
- Date: Sat, 20 Sep 2025 13:27:13 GMT
- Title: EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs
- Authors: Zhengge Cai, Haowen Hou,
- Abstract summary: Key-value ( KV) cache size is a crucial step toward enabling efficient inference in large language models (LLMs)<n>Recent work on Multi-head Latent Attention (MLA) mitigates this by compressing KV representations into a shared latent space.<n>We propose textbfEmbedding-Gated Multi-head Latent Attention (EG-MLA), a novel extension of MLA that further reduces KV cache size while enhancing representational expressiveness.
- Score: 8.093922145280326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reducing the key-value (KV) cache size is a crucial step toward enabling efficient inference in large language models (LLMs), especially under latency and memory constraints. While Multi-Head Attention (MHA) offers strong representational power, it incurs significant memory overhead. Recent work on Multi-head Latent Attention (MLA) mitigates this by compressing KV representations into a shared latent space, achieving a better trade-off between performance and cache efficiency. While MLA already achieves significant KV cache reduction, the scope for further compression remains limited without performance loss. In this paper, we propose \textbf{Embedding-Gated Multi-head Latent Attention (EG-MLA)}, a novel extension of MLA that further reduces KV cache size while enhancing representational expressiveness. EG-MLA introduces a token-specific embedding gating mechanism applied in the latent space, enabling fine-grained modulation of compressed KV vectors with minimal additional computation. Compared to MHA, EG-MLA achieves over 91.6\% reduction in KV cache size with negligible performance degradation. Relative to MLA, EG-MLA consistently improves task accuracy across diverse reasoning benchmarks while achieving up to 59.9\% additional memory savings. Our theoretical analysis highlights how embedding gating induces implicit high-order interactions, and empirical evaluations demonstrate robust generalization across model scales and compression regimes. Notably, we successfully scale EG-MLA to over 1 billion parameters, demonstrating its practical viability for large-scale LLM deployment. These results establish EG-MLA as a memory- and compute-efficient attention mechanism that enables scalable, high-performance inference in modern LLMs.
Related papers
- Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion [47.317377282106015]
We introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model.<n>We show that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.
arXiv Detail & Related papers (2026-02-28T09:24:01Z) - Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z) - MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models [37.41464628858585]
We present MHA2MLA-VLM, a framework for converting off-the-shelf vision-language models to Multi-Head Latent Attention (MLA)<n>We show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.
arXiv Detail & Related papers (2026-01-16T17:45:34Z) - SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference [71.20542521694524]
SmallKV is a small model assisted compensation method for KV cache compression.<n>We show that SmallKV achieves 1.75 - 2.56 times higher throughput than baseline methods.
arXiv Detail & Related papers (2025-08-03T09:15:36Z) - IAM: Efficient Inference through Attention Mapping between Different-scale LLMs [74.81417160018856]
IAM framework achieves dual benefits of accelerated attention computation and reduced KV cache usage.<n>We show that IAM can accelerate prefill by 15% and reduce KV cache usage by 22.1% without appreciably sacrificing performance.
arXiv Detail & Related papers (2025-07-16T06:39:11Z) - Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM [7.651654889371008]
Transformer-based models are the foundation of modern machine learning, but their execution places significant pressure on memory systems.<n> processing-in-memory (PIM) architectures are a promising solution, offering high internal bandwidth and compute parallelism near memory.<n>Current PIM designs are primarily optimized for dense attention and struggle with the dynamic, irregular access patterns introduced by modern KV cache sparsity techniques.
arXiv Detail & Related papers (2025-05-09T04:17:05Z) - Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs [74.74225314708225]
Multi-head Latent Attention (MLA) is an innovative architecture designed to ensure efficient and economical inference.<n>This paper proposes the first data-efficient fine-tuning method for transitioning from Multi-Head Attention to MLA.
arXiv Detail & Related papers (2025-02-20T18:50:42Z) - TransMLA: Multi-Head Latent Attention Is All You Need [34.38934956358534]
TransMLA is a framework that seamlessly converts GQA-based models to MLA-based models.<n>By compressing 93% of the KV cache in LLaMA-2-7B, TransMLA achieves a 10.6x inference speedup at an 8K context length.
arXiv Detail & Related papers (2025-02-11T18:20:18Z) - Multi-matrix Factorization Attention [59.10039136733939]
We propose Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR)<n>MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads.<n>MFA-KR further reduces memory requirements by repurposing the key cache as value.
arXiv Detail & Related papers (2024-12-26T15:45:45Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.