Fugu-MT 論文翻訳(概要): Visual CoT Makes VLMs Smarter but More Fragile

論文の概要: Visual CoT Makes VLMs Smarter but More Fragile

arxiv url: http://arxiv.org/abs/2509.23789v1
Date: Sun, 28 Sep 2025 10:19:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.448892
Title: Visual CoT Makes VLMs Smarter but More Fragile
Title（参考訳）: Visual CoTでVLMはより賢く、より繊細に
Authors: Chunxue Xu, Yiwei Wang, Yujun Cai, Bryan Hooi, Songze Li,
Abstract要約: チェーン・オブ・ソート(CoT)技術は視覚言語モデル(VLM)における推論を著しく向上させた Visual CoTは、興味のある領域のトリミングや注釈付けなどの明示的なビジュアル編集を推論プロセスに統合する。視覚摂動下での視覚的CoTロバスト性の最初の体系的評価について述べる。
参考スコア（独自算出の注目度）: 79.32638667101817
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Chain-of-Thought (CoT) techniques have significantly enhanced reasoning in Vision-Language Models (VLMs). Extending this paradigm, Visual CoT integrates explicit visual edits, such as cropping or annotating regions of interest, into the reasoning process, achieving superior multimodal performance. However, the robustness of Visual CoT-based VLMs against image-level noise remains unexplored. In this paper, we present the first systematic evaluation of Visual CoT robustness under visual perturbations. Our benchmark spans 12 image corruption types across 4 Visual Question Answering (VQA) datasets, enabling a comprehensive comparison between VLMs that use Visual CoT, and VLMs that do not. The results reveal that integrating Visual CoT consistently improves absolute accuracy regardless of whether the input images are clean or corrupted by noise; however, it also increases sensitivity to input perturbations, resulting in sharper performance degradation compared to standard VLMs. Through extensive analysis, we identify the intermediate reasoning components of Visual CoT, i.e., the edited image patches , as the primary source of fragility. Building on this analysis, we propose a plug-and-play robustness enhancement method that integrates Grounding DINO model into the Visual CoT pipeline, providing high-confidence local visual cues to stabilize reasoning. Our work reveals clear fragility patterns in Visual CoT and offers an effective, architecture-agnostic solution for enhancing visual robustness.
Abstract（参考訳）: チェーン・オブ・ソート(CoT)技術はビジョン・ランゲージ・モデル(VLM)における推論を大幅に強化した。このパラダイムを拡張して、Visual CoTは、トリミングや注釈付けのような明示的な視覚的編集を推論プロセスに統合し、優れたマルチモーダルパフォーマンスを実現する。しかし、画像レベルのノイズに対するVisual CoTベースのVLMのロバスト性は未解明のままである。本稿では,視覚的摂動下での視覚的CoTロバスト性の最初の体系的評価について述べる。我々のベンチマークは、4つのVisual Question Answering (VQA)データセットにまたがる12のイメージ破損タイプにまたがっており、Visual CoTを使用するVLMと、そうでないVLMの包括的な比較を可能にする。その結果,入力画像のノイズによる劣化の有無にかかわらず,Visual CoTの統合により絶対精度が向上することがわかったが,入力摂動に対する感度も向上し,通常のVLMに比べて性能劣化が著しくなった。広範に解析することで、Visual CoTの中間的推論コンポーネント、すなわち、編集された画像パッチを、脆弱性の主要な原因として識別する。そこで本研究では,Funding DINOモデルをVisual CoTパイプラインに統合し,信頼性の高い局所的な視覚的手がかりを提供することにより,推論の安定化を図る。私たちの研究は、Visual CoTの明らかな脆弱性パターンを明らかにし、視覚的堅牢性を高める効果的なアーキテクチャに依存しないソリューションを提供します。

論文の概要: Visual CoT Makes VLMs Smarter but More Fragile

関連論文リスト