Fugu-MT 論文翻訳(概要): VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

論文の概要: VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

arxiv url: http://arxiv.org/abs/2604.02486v1
Date: Thu, 02 Apr 2026 19:40:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 17:20:24.190352
Title: VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
Title（参考訳）: VLMには言葉が必要だ:視覚言語モデル
Authors: Haz Sameen Shahgir, Xiaofu Chen, Yu Fu, Erfan Shayegani, Nael Abu-Ghazaleh, Yova Kementchedjhieva, Yue Dong,
Abstract要約: 視覚言語モデル(VLM)は、幅広いマルチモーダルタスクにおいて印象的なパフォーマンスを達成する。しかし、必要な情報が内部表現に存在する場合でも失敗することが多い。このギャップは、視覚情報をテキスト空間に移動することに焦点を当てた、狭いトレーニングパイプラインから生じている。
参考スコア（独自算出の注目度）: 14.288057170664983
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we demonstrate that this gap arises from their narrow training pipeline which focuses on moving visual information to the textual space. Consequently, VLMs can only reason about visual entities that can be mapped to known concepts in the language space, leaving vision-focused tasks such as visual correspondence and reasoning about novel visual entities poorly supported. As a result, VLMs are severely limited in several important multimodal capabilities because they rely on brittle, hallucinated textual descriptions of visual entities that they cannot map to textual representations. We verify this behavior through visual correspondence tasks, in which VLMs must detect matching entities between two images. Testing across semantic, shape, and face correspondence tasks, we find that VLMs perform much better when the relevant entities are nameable in language than when they are unnameable. Mechanistically, our Logit Lens analyses confirm that VLMs explicitly assign semantic labels to nameable entities and surface more unique corresponding tokens compared to unnameable entities. Furthermore, we show that teaching completely arbitrary names for unknown entities improves performance, yet task-specific finetuning yields even stronger generalization without relying on language priors. Our findings suggest that current VLM failures on visual tasks reflect learned shortcuts from their training, rather than a fundamental limitation of multimodal architectures.
Abstract（参考訳）: 視覚言語モデル(VLM)は、幅広いマルチモーダルタスクにおいて印象的なパフォーマンスを達成する。しかし、微粒な視覚的知覚を必要とするタスクでは、必要な情報が内部表現に存在する場合でも失敗することが多い。本研究では,視覚情報をテキスト空間に移動させることに焦点をあてた,狭いトレーニングパイプラインから,このギャップが生じることを実証する。したがって、VLMは言語空間の既知の概念にマッピングできる視覚的実体のみを推論することができ、視覚的対応や新しい視覚的実体の推論といった視覚的なタスクは不十分である。結果として、VLMは、テキスト表現にマッピングできない視覚的実体の脆く幻覚的なテキスト記述に依存しているため、いくつかの重要なマルチモーダル機能において著しく制限されている。この動作は視覚対応タスクによって検証され、VLMは2つの画像間の一致した実体を検出する必要がある。セマンティック、形状、対面対応タスクでテストしたところ、VLMは言語で名前付け可能な場合よりも、言語で名前付け可能な場合の方がはるかに優れていることが分かりました。メカニカルに、本誌のLogit Lens分析では、VLMが意味ラベルを名付け可能なエンティティに明示的に割り当て、名前付けできないエンティティよりもユニークな対応するトークンをサーフェスすることを確認しています。さらに、未知のエンティティに対して完全に任意の名前を教えることで、性能が向上するが、タスク固有の微調整は、言語の事前に頼らずにさらに強力な一般化をもたらすことを示す。この結果から,視覚的タスクにおける現在のVLM障害は,マルチモーダルアーキテクチャの基本的限界ではなく,学習したショートカットを反映していることが示唆された。

論文の概要: VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

関連論文リスト