Related papers: MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

URL: http://arxiv.org/abs/2601.11464v1
Date: Fri, 16 Jan 2026 17:45:34 GMT
Title: MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models
Authors: Xiaoran Fan, Zhichao Sun, Tao Ji, Lixing Shen, Tao Gui,
Abstract summary: We present MHA2MLA-VLM, a framework for converting off-the-shelf vision-language models to Multi-Head Latent Attention (MLA)<n>We show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.
Score: 37.41464628858585
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.

Related papers

Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z)
EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs [8.093922145280326]
Key-value ( KV) cache size is a crucial step toward enabling efficient inference in large language models (LLMs)<n>Recent work on Multi-head Latent Attention (MLA) mitigates this by compressing KV representations into a shared latent space.<n>We propose textbfEmbedding-Gated Multi-head Latent Attention (EG-MLA), a novel extension of MLA that further reduces KV cache size while enhancing representational expressiveness.
arXiv Detail & Related papers (2025-09-20T13:27:13Z)
SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference [71.20542521694524]
SmallKV is a small model assisted compensation method for KV cache compression.<n>We show that SmallKV achieves 1.75 - 2.56 times higher throughput than baseline methods.
arXiv Detail & Related papers (2025-08-03T09:15:36Z)
Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study [64.26593350748401]
Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities.<n>Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs)<n>We propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training.
arXiv Detail & Related papers (2025-07-28T11:57:52Z)
IAM: Efficient Inference through Attention Mapping between Different-scale LLMs [74.81417160018856]
IAM framework achieves dual benefits of accelerated attention computation and reduced KV cache usage.<n>We show that IAM can accelerate prefill by 15% and reduce KV cache usage by 22.1% without appreciably sacrificing performance.
arXiv Detail & Related papers (2025-07-16T06:39:11Z)
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference [13.069489189643441]
MadaKV is a modality-adaptive key-value cache eviction strategy for long-context inference.<n>It achieves substantial reductions in KV cache memory footprint and model inference decoding latency.<n>Experiments on representative MLLMs and the MileBench benchmark demonstrate the effectiveness of MadaKV.
arXiv Detail & Related papers (2025-06-06T01:51:24Z)
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs [92.7279890407059]
Multi-head Latent Attention (MLA) is an innovative architecture designed to ensure efficient and economical inference.<n>This paper proposes the first data-efficient fine-tuning method for transitioning from Multi-Head Attention to MLA.
arXiv Detail & Related papers (2025-02-20T18:50:42Z)
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference [32.20654044142376]
LOOK-M is a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size. It achieves up to 1.5x faster decoding and also maintains or even enhances performance across a variety of long context multimodal tasks.
arXiv Detail & Related papers (2024-06-26T07:44:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.