Related papers: Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

URL: http://arxiv.org/abs/2603.00563v1
Date: Sat, 28 Feb 2026 09:24:01 GMT
Title: Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion
Authors: Sen Zhang, Jianguo Wei, Wenhuan Lu, Xianghu Yue, Wei Li, Qiang Li, Pengcheng Zhao, Ming Cai, Luo Si,
Abstract summary: We introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model.<n>We show that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.
Score: 47.317377282106015
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper's absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning. Extensive experiments on the LibriSpeech benchmark validate the effectiveness of this conversion, demonstrating that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.

Related papers

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition [39.90876258237132]
Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities.<n>MoME is a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based large language models for speech recognition.<n>MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters.
arXiv Detail & Related papers (2025-10-05T10:34:34Z)
Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models [60.857389526958485]
MATA is a training-free method that dynamically pushes LALMs to pay textbfMore textbfAttention textbfTo textbfAudio tokens within the self-attention mechanism.<n>Experiments on the MMAU and MMAR benchmarks confirm MATA's effectiveness, with consistent performance gains.
arXiv Detail & Related papers (2025-09-23T09:02:15Z)
EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs [8.093922145280326]
Key-value ( KV) cache size is a crucial step toward enabling efficient inference in large language models (LLMs)<n>Recent work on Multi-head Latent Attention (MLA) mitigates this by compressing KV representations into a shared latent space.<n>We propose textbfEmbedding-Gated Multi-head Latent Attention (EG-MLA), a novel extension of MLA that further reduces KV cache size while enhancing representational expressiveness.
arXiv Detail & Related papers (2025-09-20T13:27:13Z)
Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing [33.36615989947073]
We present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR)<n>We explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline.<n>Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower.
arXiv Detail & Related papers (2025-09-20T10:48:06Z)
KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z)
X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression [30.770661469301544]
Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression.<n>We show that our proposed method can effectively compress the KV cache while preserving the performance on the benchmarks.
arXiv Detail & Related papers (2025-03-14T06:49:37Z)
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs [92.7279890407059]
Multi-head Latent Attention (MLA) is an innovative architecture designed to ensure efficient and economical inference.<n>This paper proposes the first data-efficient fine-tuning method for transitioning from Multi-Head Attention to MLA.
arXiv Detail & Related papers (2025-02-20T18:50:42Z)
TransMLA: Multi-Head Latent Attention Is All You Need [34.38934956358534]
TransMLA is a framework that seamlessly converts GQA-based models to MLA-based models.<n>By compressing 93% of the KV cache in LLaMA-2-7B, TransMLA achieves a 10.6x inference speedup at an 8K context length.
arXiv Detail & Related papers (2025-02-11T18:20:18Z)
Multi-matrix Factorization Attention [59.10039136733939]
We propose Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR)<n>MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads.<n>MFA-KR further reduces memory requirements by repurposing the key cache as value.
arXiv Detail & Related papers (2024-12-26T15:45:45Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.