Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion
- URL: http://arxiv.org/abs/2603.00563v1
- Date: Sat, 28 Feb 2026 09:24:01 GMT
- Title: Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion
- Authors: Sen Zhang, Jianguo Wei, Wenhuan Lu, Xianghu Yue, Wei Li, Qiang Li, Pengcheng Zhao, Ming Cai, Luo Si,
- Abstract summary: We introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model.<n>We show that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.
- Score: 47.317377282106015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper's absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning. Extensive experiments on the LibriSpeech benchmark validate the effectiveness of this conversion, demonstrating that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.
Related papers
- MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition [39.90876258237132]
Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities.<n>MoME is a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based large language models for speech recognition.<n>MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters.
arXiv Detail & Related papers (2025-10-05T10:34:34Z) - Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models [60.857389526958485]
MATA is a training-free method that dynamically pushes LALMs to pay textbfMore textbfAttention textbfTo textbfAudio tokens within the self-attention mechanism.<n>Experiments on the MMAU and MMAR benchmarks confirm MATA's effectiveness, with consistent performance gains.
arXiv Detail & Related papers (2025-09-23T09:02:15Z) - EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs [8.093922145280326]
Key-value ( KV) cache size is a crucial step toward enabling efficient inference in large language models (LLMs)<n>Recent work on Multi-head Latent Attention (MLA) mitigates this by compressing KV representations into a shared latent space.<n>We propose textbfEmbedding-Gated Multi-head Latent Attention (EG-MLA), a novel extension of MLA that further reduces KV cache size while enhancing representational expressiveness.
arXiv Detail & Related papers (2025-09-20T13:27:13Z) - Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing [33.36615989947073]
We present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR)<n>We explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline.<n>Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower.
arXiv Detail & Related papers (2025-09-20T10:48:06Z) - KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z) - X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression [30.770661469301544]
Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression.<n>We show that our proposed method can effectively compress the KV cache while preserving the performance on the benchmarks.
arXiv Detail & Related papers (2025-03-14T06:49:37Z) - Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs [92.7279890407059]
Multi-head Latent Attention (MLA) is an innovative architecture designed to ensure efficient and economical inference.<n>This paper proposes the first data-efficient fine-tuning method for transitioning from Multi-Head Attention to MLA.
arXiv Detail & Related papers (2025-02-20T18:50:42Z) - TransMLA: Multi-Head Latent Attention Is All You Need [34.38934956358534]
TransMLA is a framework that seamlessly converts GQA-based models to MLA-based models.<n>By compressing 93% of the KV cache in LLaMA-2-7B, TransMLA achieves a 10.6x inference speedup at an 8K context length.
arXiv Detail & Related papers (2025-02-11T18:20:18Z) - Multi-matrix Factorization Attention [59.10039136733939]
We propose Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR)<n>MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads.<n>MFA-KR further reduces memory requirements by repurposing the key cache as value.
arXiv Detail & Related papers (2024-12-26T15:45:45Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.