Fugu-MT 論文翻訳(概要): CAPA: Contribution-Aware Pruning and FFN Approximation for Efficient Large Vision-Language Models

論文の概要: CAPA: Contribution-Aware Pruning and FFN Approximation for Efficient Large Vision-Language Models

arxiv url: http://arxiv.org/abs/2602.00247v1
Date: Fri, 30 Jan 2026 19:09:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.075225
Title: CAPA: Contribution-Aware Pruning and FFN Approximation for Efficient Large Vision-Language Models
Title（参考訳）: CAPA:高能率ビジョンランゲージモデルに対するコントリビューション・アウェア・プルーニングとFFN近似
Authors: Samyak Jha, Junho Kim,
Abstract要約: 本研究では,注目度をベクトルサイズで重み付けした注意貢献が,視覚的トークン選択のためのより正確な基準を提供することを示す。本稿では、重要な機能遷移における注意貢献を用いて視覚トークンを創出する二重戦略フレームワークであるCAPA(Contribution-Aware Pruning and FFN Approximation)を紹介する。
参考スコア（独自算出の注目度）: 14.30682201364961
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficient inference in Large Vision-Language Models is constrained by the high cost of processing thousands of visual tokens, yet it remains unclear which tokens and computations can be safely removed. While attention scores are commonly used to estimate visual token importance, they are an imperfect proxy for actual contribution. We show that Attention Contribution, which weights attention probabilities by value vector magnitude, provides a more accurate criterion for visual token selection. Our empirical analysis reveals that visual attention sinks are functionally heterogeneous, comprising Probability Dumps with low contribution that can be safely pruned, and Structural Anchors with high contribution essential for maintaining model performance. Further, we identify substantial redundancy in Feed-Forward Networks (FFNs) associated with visual tokens, particularly in intermediate layers where image tokens exhibit linear behavior. Based on our findings, we introduce CAPA (Contribution-Aware Pruning and FFN Approximation), a dual-strategy framework that prunes visual tokens using attention contribution at critical functional transitions and reduces FFN computation through efficient linear approximations. Experiments on various benchmarks across baselines show that CAPA achieves competent efficiency--performance trade-offs with improved robustness.
Abstract（参考訳）: 大きな視覚-言語モデルにおける効率的な推論は、数千の視覚トークンを処理するのに高いコストで制約されているが、どのトークンや計算を安全に除去できるかは定かではない。注意スコアは視覚的トークンの重要性を推定するために一般的に使用されるが、実際のコントリビューションに対する不完全なプロキシである。本研究では,注目度をベクトルサイズで重み付けした注意貢献が,視覚的トークン選択のためのより正確な基準を提供することを示す。実験により,視覚的注意シンクは機能的に不均一であり,安全に刈り取ることができる確率ダンプと,モデル性能維持に不可欠な構造アンカーとから構成されていることが明らかとなった。さらに、画像トークンが線形な振る舞いを示す中間層において、視覚トークンに関連するフィードフォワードネットワーク(FFN)のかなりの冗長性を同定する。本稿では,重要な機能的遷移における注意貢献による視覚トークンの創出と,効率的な線形近似によるFFN計算の削減を両立させたCAPA(Contribution-Aware Pruning and FFN Approximation)を提案する。ベースラインにまたがる様々なベンチマーク実験により、CAPAは堅牢性を改善した高性能なトレードオフを実現することが示されている。

論文の概要: CAPA: Contribution-Aware Pruning and FFN Approximation for Efficient Large Vision-Language Models

関連論文リスト