Fugu-MT 論文翻訳(概要): On the Faithfulness of Visual Thinking: Measurement and Enhancement

論文の概要: On the Faithfulness of Visual Thinking: Measurement and Enhancement

arxiv url: http://arxiv.org/abs/2510.23482v1
Date: Mon, 27 Oct 2025 16:15:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:15.614403
Title: On the Faithfulness of Visual Thinking: Measurement and Enhancement
Title（参考訳）: 視覚的思考の信条:測定と拡張について
Authors: Zujing Liu, Junwen Pan, Qi She, Yuan Gao, Guisong Xia,
Abstract要約: 最近の視覚言語モデルは、強化微調整後、視覚テキストのマルチモーダル・チェーン・オブ・ソート・トレースを生成することができる。 MCoTに組み込まれた視覚情報はしばしば不正確であるが、正確な答えは得られない。本稿では,Sufficient-Component Cause Model (SCCM) 学習と呼ばれる新しいMCoT学習戦略を提案する。
参考スコア（独自算出の注目度）: 37.52991654147004
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent large vision-language models (LVLMs) can generate vision-text multimodal chain-of-thought (MCoT) traces after reinforcement fine-tuning (RFT). However, we observe that the visual information incorporated in MCoT is often inaccurate, though still yield correct answers, indicating a lack of faithfulness in the MCoT reasoning process. We attribute this unfaithfulness to the RL reward in RFT, which solely incentivizes the format of interleaved vision-text cues, ie, it encourages the model to incorporate visual information into its text reasoning steps without considering the correctness of the visual information. In this paper, we first probe the faithfulness of MCoT by measuring how much the prediction changes when its visual and textual thoughts are intervened. Surprisingly, the model's predictions remain nearly unchanged under visual intervention but change significantly under textual intervention, indicating that the visual evidence is largely ignored. To further analyze visual information, we introduce an automated LVLM-based evaluation metric that quantifies the faithfulness of visual cues from two perspectives: reliability and sufficiency. Our evaluation reveals that the visual information in current MCoT traces is simultaneously unreliable and insufficient. To address this issue, we propose a novel MCoT learning strategy termed Sufficient-Component Cause Model (SCCM) learning. This approach encourages the MCoT to generate sufficient yet minimal visual components that are independently capable of leading to correct answers. We note that the proposed SCCM is annotation-free and compatible with various RFT for MCoT in a plug-and-play manner. Empirical results demonstrate that SCCM consistently improves the visual faithfulness across a suite of fine-grained perception and reasoning benchmarks. Code is available at https://github.com/EugeneLiu01/Faithful_Thinking_with_Image.
Abstract（参考訳）: 近年の大規模視覚言語モデル(LVLM)は、強化微細チューニング(RFT)後、視覚テキストのマルチモーダル・チェーン・オブ・シント(MCoT)のトレースを生成することができる。しかし、MCoTに組み込まれた視覚情報は、しばしば不正確であるが、依然として正しい答えが得られており、MCoT推論プロセスにおける忠実性の欠如が示唆されている。この不誠実さは、視覚情報の正しさを考慮せずに、視覚情報をテキスト推論ステップに組み込むことをモデルに奨励する、RFTにおけるRL報酬に起因している。本稿では、まず、視覚的思考とテキスト的思考が介入された際の予測がどれほど変化するかを測定することで、MCoTの忠実さを調査する。驚くべきことに、モデルの予測は視覚的介入の下でほとんど変化しないが、テキスト的介入によって大幅に変化し、視覚的証拠がほとんど無視されていることを示している。視覚情報をさらに分析するために,信頼性と充足性という2つの視点から視覚的手がかりの忠実度を定量化する,LVLMに基づく自動評価指標を導入する。評価の結果、現在のMCoTトレースの視覚情報は、同時に信頼性が低く、不十分であることが判明した。そこで本研究では,SCCM学習と呼ばれる新しいMCoT学習戦略を提案する。このアプローチは、MCoTが独立して正しい答えを導くことができる十分な最小限のビジュアルコンポーネントを生成することを奨励する。提案したSCCMはアノテーションフリーで,MCoT用の様々なRTTとプラグイン・アンド・プレイ方式で互換性がある。実証的な結果から、SCCMは微妙な知覚と推論のベンチマークによって、視覚的忠実度を一貫して改善することが示された。コードはhttps://github.com/EugeneLiu01/Faithful_Thinking_with_Imageで公開されている。

論文の概要: On the Faithfulness of Visual Thinking: Measurement and Enhancement

関連論文リスト