Fugu-MT 論文翻訳(概要): Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

論文の概要: Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

arxiv url: http://arxiv.org/abs/2602.01901v1
Date: Mon, 02 Feb 2026 10:08:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:34.064171
Title: Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model
Title（参考訳）: Qキャッシュ: マルチモーダル大言語モデルのためのデコード層の半分以下で視覚的注意が評価できる
Authors: Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji Hu,
Abstract要約: マルチモーダル大言語モデル(MLLM)は、視覚トークンの拡散に起因する外乱推論コストに悩まされている。既存のアプローチでは、トークンの最適化に重点を置いており、さまざまなトークンプルーニング技術を活用して、非極端なビジュアルトークンを排除している。同様の注意パターンの層間共有を可能にする効果的な注意機構であるLazy Attentionを提案する。
参考スコア（独自算出の注目度）: 21.206033754351786
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and key-value (KV) cache footprint bottleneck. Existing approaches focus on token-wise optimization, leveraging diverse intricate token pruning techniques to eliminate non-crucial visual tokens. Nevertheless, these methods often unavoidably undermine the integrity of the KV cache, resulting in failures in long-text generation tasks. To this end, we conduct an in-depth investigation towards the attention mechanism of the model from a new perspective, and discern that attention within more than half of all decode layers are semantic similar. Upon this finding, we contend that the attention in certain layers can be streamlined by inheriting the attention from their preceding layers. Consequently, we propose Lazy Attention, an efficient attention mechanism that enables cross-layer sharing of similar attention patterns. It ingeniously reduces layer-wise redundant computation in attention. In Lazy Attention, we develop a novel layer-shared cache, Q Cache, tailored for MLLMs, which facilitates the reuse of queries across adjacent layers. In particular, Q Cache is lightweight and fully compatible with existing inference frameworks, including Flash Attention and KV cache. Additionally, our method is highly flexible as it is orthogonal to existing token-wise techniques and can be deployed independently or combined with token pruning approaches. Empirical evaluations on multiple benchmarks demonstrate that our method can reduce KV cache usage by over 35% and achieve 1.5x throughput improvement, while sacrificing only approximately 1% of performance on various MLLMs. Compared with SOTA token-wise methods, our technique achieves superior accuracy preservation.
Abstract（参考訳）: マルチモーダル大言語モデル(MLLM)は、視覚エンコーダ内の視覚トークンの拡散に起因する外乱的推論コストに悩まされている。冗長なビジュアルトークンは、かなりの計算負荷とキー値(KV)キャッシュのフットプリントボトルネックを増大させる。既存のアプローチではトークンの最適化に重点を置いており、さまざまな複雑なトークンプルーニング技術を活用して、非極端なビジュアルトークンを排除している。しかしながら、これらの手法はKVキャッシュの整合性を必然的に損なうことがあり、長文生成タスクでは失敗する。この目的のために、新しい視点からモデルの注意機構について詳細な調査を行い、すべてのデコード層の半分以上の注意が意味論的に類似していることを明らかにする。この発見により,先行するレイヤから注目を引き継ぐことで,特定のレイヤの注意を合理化できると主張している。そこで本研究では,類似の注意パターンの層間共有を可能にする効果的な注意機構であるLazy Attentionを提案する。レイヤーワイドの冗長な計算を注意して、巧妙に削減する。 Lazy Attentionでは、隣接層間のクエリの再利用を容易にするMLLM向けに、新しいレイヤ共有キャッシュであるQキャッシュを開発した。特にQキャッシュは軽量で、Flash AttentionやKVキャッシュを含む既存の推論フレームワークと完全に互換性がある。さらに,既存のトークンワイド技術と直交し,独立してデプロイしたり,トークンプルーニングアプローチと組み合わせたりできるため,本手法は柔軟である。複数のベンチマークで実証評価を行った結果,KVキャッシュ使用率を35%以上削減し,1.5倍のスループット向上を実現し,MLLMの性能を約1%に抑えることができた。本手法は,SOTAトークン単位の手法と比較して,高精度な保存を実現する。

論文の概要: Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

関連論文リスト