Fugu-MT 論文翻訳(概要): VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

論文の概要: VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

arxiv url: http://arxiv.org/abs/2603.23495v1
Date: Tue, 24 Mar 2026 17:58:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.630595
Title: VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
Title（参考訳）: VISion On Request: スパース、動的選択、視覚言語相互作用によるVLLM効率の向上
Authors: Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Yassine Ouali, Georgios Tzimiropoulos,
Abstract要約: 視覚情報を捨てることなく推論コストを削減するVISOR(VISion On Request)を導入する。 VISORは画像とテキストトークン間の相互作用をスパースすることで効率を向上する。実験により、VISORは、最先端の結果を一致または超えながら、計算コストを大幅に削減することが示された。
参考スコア（独自算出の注目度）: 51.41587958253802
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.
Abstract（参考訳）: 既存のLVLM(Large Vision-Language Models)の効率向上のためのアプローチは、主に視覚トークンの低減の概念に基づいている。しかしこのアプローチは、特にきめ細かい理解と推論を必要とする困難なタスクにおいて、パフォーマンスを損なう情報ボトルネックを生み出します。本研究では,視覚情報を捨てることなく推論コストを削減するVISOR(VISOR)を導入することで,このパラダイムに挑戦する。画像を圧縮する代わりに、VISORは画像とテキストトークン間の相互作用をスパースすることで効率を向上する。特に、言語モデルは、小さな戦略的に配置された注意層を通して、高解像度の視覚トークンの完全なセットに付随する: 一般的な視覚コンテキストは、テキストイメージ間の効率的な相互アテンションによって提供され、いくつかのよく配置され、動的に選択された自己アテンション層は、視覚表現自体を洗練し、必要に応じて複雑で高解像度な推論を可能にする。この原理に基づいて,まず,自己注意層数を変化させることにより,計算予算の幅で単一のユニバーサルネットワークを訓練し,さらに,サンプル単位の複雑性に基づいて動的に視覚的計算を割り当てる軽量なポリシー機構を導入する。大規模な実験により、VISORは様々なベンチマークスイートにまたがって、最先端の結果をマッチングしたり超えたりしながら、計算コストを大幅に削減し、詳細な視覚的理解を必要とする課題に優れていることが示されている。

論文の概要: VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

関連論文リスト