Fugu-MT 論文翻訳(概要): Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

論文の概要: Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

arxiv url: http://arxiv.org/abs/2605.19218v1
Date: Tue, 19 May 2026 00:45:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.043749
Title: Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference
Title（参考訳）: 効率的な視覚・言語モデル推論のための回転アライメントキーチャネルプルーニング
Authors: Beomseok Kang, Dongwon Jo, Jiwon Song, Donghwee Son, Jae-Joon Kim,
Abstract要約: ビジョンランゲージモデル(Vision-Language Model)は、単一のイメージが数千のトークンにエンコードされるため、推論時に厳しいKVキャッシュ圧力を被る。既存のほとんどの手法はトークンプルーニングによってトークンの空白を悪用するが、視覚的コンテンツを永久に破棄することでかなりの劣化を引き起こす。回転型構造化キーチャネルプルーニングフレームワークであるRotateKを開発した。
参考スコア（独自算出の注目度）: 12.99113243259336
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves more visual tokens at the same memory cost. Prior Key channel pruning methods, however, face a structural trade-off: token-wise channel pruning is expressive but unstructured and slow, while head-wise approach is hardware-friendly but less robust. We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks; a fused Triton attention kernel operates directly on sparse-channel Keys for efficient decoding. Experiments on two representative VLM backbones show that RotateK consistently outperforms prior Key channel pruning in both accuracy and decoding latency, while joint token-channel pruning improves over token-only baselines at matched KV cache budgets.
Abstract（参考訳）: ビジョンランゲージモデル(Vision-Language Model)は、単一のイメージが数千のトークンにエンコードされるため、推論時に厳しいKVキャッシュ圧力を被る。既存のほとんどの手法はトークンプルーニングによってトークンの空間性を悪用するが、視覚的コンテンツを永久に破棄することで、きめ細かい知覚タスクが大幅に劣化する。固定KVキャッシュ予算の下では、チャネル次元を圧縮することで、同じメモリコストでより多くのビジュアルトークンを保存できる。トークン・ワイド・チャネル・プルーニングは表現力があるが、非構造的で遅く、ヘッド・ワイド・アプローチはハードウェアフレンドリーだが、ロバストではない。回転型構造化キーチャネルプルーニングフレームワークであるRotateKでこれを解決する。 RotateKは、トークン依存チャネルの重要度を共有低次元のサブスペースに整合させるオンラインPCAベースのローテーションを適用し、軽量な頭部マスクの下で正確なプルーニングを可能にする。 2つの代表的なVLMバックボーンの実験では、RotateKはキーチャネルプルーニングの精度とデコード遅延の両方で一貫してパフォーマンスが向上し、一方、ジョイントトークンチャネルプルーニングは一致するKVキャッシュ予算においてトークンのみのベースラインよりも改善されている。

論文の概要: Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

関連論文リスト