Fugu-MT 論文翻訳(概要): TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding

論文の概要: TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding

arxiv url: http://arxiv.org/abs/2606.08464v1
Date: Sun, 07 Jun 2026 05:58:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.120366
Title: TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding
Title（参考訳）: TVI-CoT:マルチモーダル理解のためのテキスト・ビジュアル・インターリーブド・チェーン・オブ・ソート推論
Authors: Lianyu Hu, Xiaoyu Ma, Zeqin Liao, Yang Liu,
Abstract要約: 思考の連鎖(CoT)推論は、大規模言語モデルにおける問題解決の強化に有効であることが証明されている。既存のCoTアプローチは基本的な制限に悩まされており、完全にテキストで推論を行う。テキスト推論と視覚的特徴アクセスの明示的なインターリーブを実現するためのテキスト・ビジュアル・インターリーブド・チェーン・オブ・ワット(TVI-CoT)フレームワークを提案する。
参考スコア（独自算出の注目度）: 10.402346011516423
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chain-of-thought (CoT) reasoning has proven effective for enhancing problem-solving in large language models. However, when applied to multimodal LLMs (MLLMs), existing CoT approaches suffer from a fundamental limitation: they perform reasoning entirely in text without accessing visual features during the reasoning process. After initial visual encoding, image information becomes inaccessible, forcing models to reason based solely on whatever was captured in the initial description, which forms a `vision-blind reasoning' paradigm that limits fine-grained visual extraction, error verification, and adaptive attention. We propose Text-Visual Interleaved Chain-of-Thought (TVI-CoT), a framework that enables explicit interleaving of textual reasoning and visual feature access through learnable control tokens <THINK>, <LOOK> and <ANSWER>. These tokens allow dynamic switching between reasoning and visual grounding, attending to relevant image regions conditioned on the evolving reasoning state. Experiments on eight benchmarks demonstrate state-of-the-art results among MLLM-based CoT methods and notable performance boost compared to the baseline: +6.1% on MMMU, +3.8% on MathVerse, +3.4% on MathVista, and +3.4% on ScienceQA. Code is available at https://github.com/hulianyuyy/TVI-CoT.
Abstract（参考訳）: 思考の連鎖(CoT)推論は、大規模言語モデルにおける問題解決の強化に有効であることが証明されている。しかし、MLLM(Multimodal LLM)に適用する場合、既存のCoTアプローチは、推論プロセス中に視覚的特徴にアクセスすることなく、完全にテキストで推論を行うという根本的な制限を受ける。最初の視覚的エンコーディングの後、画像情報はアクセス不能になり、モデルに初期記述で得られたもののみに基づいて推論を強制する。学習可能な制御トークン<THINK>,<LOOK>,<ANSWER>によるテキスト推論と視覚的特徴アクセスの明示的なインターリーブを可能にするフレームワークであるText-Visual Interleaved Chain-of-Thought (TVI-CoT)を提案する。これらのトークンは推論と視覚的接地を動的に切り替えることを可能にし、進化する推論状態に条件付けられた関連する画像領域に対応する。 8つのベンチマーク実験では、MLLMベースのCoT手法の最先端結果と、MMMUの+6.1%、MathVerseの+3.8%、MathVistaの+3.4%、ScienceQAの+3.4%の顕著なパフォーマンス向上が示されている。コードはhttps://github.com/hulianyuyy/TVI-CoT.comで入手できる。

論文の概要: TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding

関連論文リスト