Fugu-MT 論文翻訳(概要): Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

論文の概要: Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

arxiv url: http://arxiv.org/abs/2604.14888v1
Date: Thu, 16 Apr 2026 11:28:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.869175
Title: Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
Title（参考訳）: 視覚言語モデルにおける推論ダイナミクスとモダリティ信頼性の限界
Authors: Danae Sánchez Villegas, Samuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott,
Abstract要約: 視覚言語モデル(VLM)における推論ダイナミクスの解析モデルが慣性に答える傾向にあり、予測に対する早期のコミットメントが強化されていることが分かっています。 Reasoning-trained(推論訓練されたモデル)は、明らかにこのキューを指す傾向が強いが、長いCoTは依然として視覚的に接地しているように見える。
参考スコア（独自算出の注目度）: 34.388508959416725
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.
Abstract（参考訳）: 視覚言語モデル(VLM)の最近の進歩は推論機能を提供しているが、これらの展開と視覚情報とテキスト情報の統合はいまだに不明である。我々は2つの異なるモデルファミリの命令調整モデルと推論訓練モデルをカバーする18個のVLMの推論ダイナミクスを解析する。我々は、CoT(Chain-of-Thought)に対する信頼度を追跡し、推論の正当性を測定し、中間的推論ステップの寄与を評価する。我々は、モデルが慣性に答える傾向にあり、予測に対する早期のコミットメントは、推論ステップ中に修正されるのではなく、強化されていることを発見した。推論学習されたモデルはより強い修正行動を示すが、その利得はテキスト優位から視覚のみの設定に至るまで、モダリティ条件に依存する。本研究は, 視覚的証拠が十分である場合でも, モデルが一定の影響を受けており, この影響がCoTから回復可能であるかどうかを検証した。この影響はCoTに現れるが、検出可能性はモデルによって異なり、監視対象によって異なる。 Reasoning-trained Model(英語版)は、明らかにこのキューを指す傾向が強いが、長いCoTは、実際にテキストのキューに従っている間、視覚的に座屈し、モダリティに依存しない。対照的に、命令で調整されたモデルは、明快さをあまり示さないが、その短いトレースは、視覚的な入力と矛盾していることを示している。これらの知見を総合すると、CoTは、異なるモダリティがVLM決定をいかに促すかの部分的なビューのみを提供し、マルチモーダルシステムの透明性と安全性に重要な意味を持つことを示している。

論文の概要: Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

関連論文リスト