Fugu-MT 論文翻訳(概要): To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models

論文の概要: To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models

arxiv url: http://arxiv.org/abs/2510.08510v1
Date: Thu, 09 Oct 2025 17:44:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:15.26262
Title: To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models
Title（参考訳）: リンクするかどうか:大規模視覚言語モデルにおける視覚情報経路
Authors: Jiayun Luo, Wan-Cyuan Fan, Lyuyang Wang, Xiangteng He, Tanzila Rahman, Purang Abolmaesumi, Leonid Sigal,
Abstract要約: Vision Transformer (ViT)は、視覚コンテンツを画像トークンのシーケンスにエンコードする。 LLM(Large Language Model)は、これらのトークンを解釈して高レベルの推論を行う。我々は、ViTアテンションシンク(ViT attention sinks)と呼ばれる、ViTから高北の視覚トークンのクラスを同定する。
参考スコア（独自算出の注目度）: 34.902254997726835
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.
Abstract（参考訳）: 大規模視覚言語モデル(LVLM)は、視覚情報とテキスト情報の両方を理解し、推論できる強力なアーキテクチャとして最近登場した。これらのモデルは一般的にビジョントランスフォーマー(ViT)とLarge Language Model(LLM)の2つの重要なコンポーネントに依存している。 ViTは視覚コンテンツを一連の画像トークンにエンコードし、知覚的なフロントエンド -- モデルの目 -- として機能する。対照的に、LLMはこれらのトークンを解釈し、高いレベルの推論を行い、反応を生成し、モデルの脳である認知コアとして機能する。しかし、どの視覚トークンが理解と推論に最も寄与しているか、そしてこれらのシグナルがViTからLLMへどのように効果的に伝播されるかは明らかになっていない。既存のほとんどの研究は、注意シンクを特定することに重点を置いているが、LLM内では、視覚エンコーダに焦点を移し、ViT(ViT attention sinks)と呼ばれる高ノルムな視覚トークンのクラスを特定している。以上の結果から,これらのViTシンクは画像から高レベルのセマンティック概念をカプセル化しており,LLMはより効果的な理解と推論を行うことができることがわかった。その重要性にもかかわらず、これらのシンクトークンは既存のLVLMアーキテクチャでは見過ごされがちである。これらの寄与を探求するため,これらのシンクトークンに埋め込まれた情報の質的および定量的解析を行った。また、この情報をLLMによってどのように解釈するかをよりよく活用するために、トレーニングフリーとトレーニングベースの両方のアプローチを提案する。これらのトークンを明示的に活用することにより、視覚的推論の強化におけるViT注意シンクの未解決の可能性を強調し、様々なLVLMや視覚的推論タスクにおいて大幅な改善を示す。

論文の概要: To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models

関連論文リスト