Fugu-MT 論文翻訳(概要): AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning

論文の概要: AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning

arxiv url: http://arxiv.org/abs/2509.25699v1
Date: Tue, 30 Sep 2025 02:57:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:44:59.989995
Title: AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning
Title（参考訳）: AIMCoT:ビジョンランゲージ推論のためのアクティブ情報駆動型マルチモーダルチェーン
Authors: Xiping Li, Jianghong Ma,
Abstract要約: CoT(Multimodal Chain-of-Thought)は,情報交換による推論の強化に有効な手法である。基本的制約に対処するtextbfActive textbfInformation-driven textbfMulti-modal textbfChain-textbfof-textbfThought フレームワークである textbfAIMCoT を提案する。
参考スコア（独自算出の注目度）: 12.026066807427945
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Chain-of-Thought (CoT) has emerged as a powerful technique for enhancing the vision-language reasoning with interleaved information. However, existing methods often rely on simplistic heuristics for constructing interleaved CoT, typically depending on attention maps, which our empirical analysis reveals can be unreliable. What's more, the shortcomings of their passive and purposeless selection strategies and their arbitrary triggering mechanisms in capturing the model's cognitive need for information are further amplified. In this paper, we propose \textbf{AIMCoT}, an \textbf{A}ctive \textbf{I}nformation-driven \textbf{M}ulti-modal \textbf{C}hain-\textbf{o}f-\textbf{T}hought framework that addresses these fundamental limitations. AIMCoT introduces three synergistic components: (1) \textbf{Context-enhanced Attention-map Generation (CAG)}, which mitigates the text-vision granularity imbalance, thereby producing more reliable attention maps as a foundation. (2) \textbf{Active Visual Probing (AVP)}, which replaces passive selection with a proactive, goal-oriented strategy grounded in information theory to select image regions that help answer the questions maximally. (3) \textbf{Dynamic Attention-shifting Trigger (DAT)}, which intelligently determines the optimal moments to insert visual information by monitoring the model's text-to-vision attention shifts. Extensive experiments on three challenging benchmarks demonstrate that AIMCoT significantly outperforms state-of-the-art methods across different settings. By actively foraging for information and dynamically structuring its reasoning process, AIMCoT represents a critical step towards more robust, effective, and human-like multimodal reasoning. Our code is available at https://anonymous.4open.science/r/AIMCoT.
Abstract（参考訳）: CoT (Multimodal Chain-of-Thought) は、視覚言語推論をインターリーブ情報で強化する強力な手法として登場した。しかしながら、既存の手法は、通常、注意マップに依存して、インターリーブされたCoTを構築するための単純化的ヒューリスティックに頼っている。さらに、彼らの受動的で目的のない選択戦略の欠点と、モデルの情報に対する認知的ニーズを捉える際の任意のトリガー機構がさらに増幅されます。本稿では,これらの基本的な制約に対処するフレームワークとして, {textbf{AIMCoT}, a \textbf{A}ctive \textbf{I}nformation-driven \textbf{M}ulti-modal \textbf{C}hain-\textbf{o}f-\textbf{T}hought を提案する。 AIMCoTは、(1) \textbf{Context-enhanced Attention-map Generation (CAG) という3つの相乗的コンポーネントを導入している。 2) <textbf{Active Visual Probing (AVP)} は,受動的選択を情報理論に基づく積極的目標指向戦略に置き換え,最大解答する画像領域を選択する。 (3) \textbf{Dynamic Attention-shifting Trigger (DAT) モデルにおけるテキスト・ツー・ビジョン・アテンション・シフトを監視して視覚情報を挿入する最適な瞬間をインテリジェントに決定する。 3つの挑戦的なベンチマークに関する大規模な実験は、AIMCoTがさまざまな設定で最先端のメソッドを著しく上回っていることを示している。情報収集を積極的に行い、推論プロセスを動的に構築することにより、AIMCoTはより堅牢で効果的で、人間に似たマルチモーダル推論への重要なステップとなる。私たちのコードはhttps://anonymous.4open.science/r/AIMCoT.comで利用可能です。

論文の概要: AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning

関連論文リスト