Fugu-MT 論文翻訳(概要): Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

論文の概要: Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

arxiv url: http://arxiv.org/abs/2510.20707v1
Date: Thu, 23 Oct 2025 16:17:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:18.331601
Title: Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
Title（参考訳）: 多様性との混合重要度:大規模視覚言語モデルにおけるKVキャッシュ圧縮のための共同最適化
Authors: Xuyang Liu, Xiyan Gui, Yuchao Zhang, Linfeng Zhang,
Abstract要約: textttMixKVは、視覚言語モデルにおける最適化KVキャッシュ圧縮において重要度と多様性を混合する新しい手法である。極端な圧縮の下で、textttMixKVは5つのマルチモーダル理解ベンチマークで平均で textbf5.1% のベースラインメソッドを改善している。
参考スコア（独自算出の注目度）: 14.603288559638614
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.
Abstract（参考訳）: 近年の大規模視覚言語モデル(LVLM)は、拡張マルチモーダルシーケンスの処理において顕著な能力を示しているが、結果としてキーバリュー(KV)キャッシュの拡張は、デプロイメントのスケーラビリティを根本的に制限する重要なメモリボトルネックを生み出している。既存のKVキャッシュ圧縮手法は、ストレージを最小限に抑えるために重要度の高いKVペアを維持することに重点を置いているが、マルチモーダルKVキャッシュで顕著に現れるモダリティ固有のセマンティック冗長パターンを見落としていることが多い。本研究では,まず,LVLMにおけるKVキャッシュが,注目ヘッド間の冗長性のレベルが異なることを明らかにする。重要度のみに依存することは、KVキャッシュ情報分布のサブセットのみをカバーすることができ、セマンティックカバレッジが失われる可能性があることを示す。そこで本研究では,LVLMの最適化KVキャッシュ圧縮において,重要度と多様性を混合した新しい手法である「texttt{MixKV}」を提案する。 \texttt{MixKV} は頭回りの意味的冗長性に対応し、KVペアを圧縮する際の多様性と重要性を選択的にバランスさせる。大規模な実験により、 \texttt{MixKV} は複数の LVLM にまたがる既存の手法を一貫して拡張することを示した。極端な圧縮 (budget=64) の下で、 \texttt{MixKV} は5つのマルチモーダル理解ベンチマークの平均である \textbf{5.1\%} によってベースラインメソッドを改善し、同じ推論効率を維持しながら、SnapKV と AdaKV の GUI グラウンドタスクに対する \textbf{8.0\%} と \textbf{9.0\%} の顕著なゲインを達成する。さらに \texttt{MixKV} は LLM にシームレスに拡張され、同等のパフォーマンスが向上する。私たちのコードは、 \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}で利用可能です。

論文の概要: Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

関連論文リスト