Fugu-MT 論文翻訳(概要): Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

論文の概要: Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

arxiv url: http://arxiv.org/abs/2606.22565v1
Date: Sun, 21 Jun 2026 15:59:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 17:39:03.607033
Title: Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do
Title（参考訳）: マルチモーダル・チェーン・オブ・サード(Multimodal Chain-of-Thought Reasoning)ができることとできないこと
Authors: Zhuoran Jin, Kejian Zhu, Hongbang Yuan, Yupu Hao, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao,
Abstract要約: CoT(Chain-of-Thought)は,大規模言語モデルにおける推論能力向上のための標準手法となっている。本稿では,マルチモーダルCoTに何ができるか,なぜ不足するのかを系統的に検討する。
参考スコア（独自算出の注目度）: 37.70222730556387
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chain-of-Thought (CoT) has become a standard method for improving reasoning capabilities in large language models (LLMs) by eliciting step-by-step thinking, but its effectiveness in multimodal tasks remains unclear. In this paper, we aim to systematically investigate the key question: What can multimodal Chain-of-Thought reasoning do, and where and why does it fall short? To this end, we evaluate 12 multimodal tasks across perception and reasoning categories using both 14 non-reasoning models and 8 reasoning models. Our analysis reveals several important findings: (1) CoT is not a free lunch and should be used selectively depending on the specific requirements of each task. For perception tasks, CoT can lead to undesirable side effects, such as reduced performance in visual grounding and object counting. In contrast, it proves effective for reasoning tasks involving mathematical, scientific, and multi-image reasoning; (2) Compared to original models, existing open-source multimodal reasoning models often yield only marginal overall improvements, possibly due to an overemphasis on mathematical reasoning at the expense of broader capabilities; (3) Visual reasoning remains a key bottleneck for current multimodal CoT, as models exhibit a Look Light, Think Heavy pattern where verbal reflection rises and falls during reasoning, whereas visual reflection consistently diminishes. These findings suggest that while multimodal CoT handles verbal reflection relatively well, it lacks the ability to maintain deep visual introspection throughout the reasoning process.
Abstract（参考訳）: CoT(Chain-of-Thought)は、ステップバイステップ思考による大規模言語モデル(LLM)の推論能力向上の標準手法となっているが、マルチモーダルタスクにおけるその有効性はいまだ不明である。本稿では,マルチモーダル・チェーン・オブ・ソート・推論に何ができるのか,なぜ不足するのか,という問を体系的に検討することを目的とする。そこで本研究では,14の非推論モデルと8の推論モデルを用いて,知覚と推論のカテゴリにわたる12のマルチモーダルタスクを評価する。分析の結果,(1)CoTはフリーランチではなく,各タスクの要求に応じて選択的に使用すべきであることがわかった。知覚タスクでは、CoTは視覚的グラウンドリングのパフォーマンスの低下やオブジェクトカウントなど、望ましくない副作用を引き起こす可能性がある。対照的に、これは数学的、科学的、マルチイメージ推論に関わるタスクの推論に有効であることが証明されている; (2) オリジナルのモデルと比較すると、既存のオープンソースマルチモーダル推論モデルは、より広範な能力を犠牲にして数学的推論を過大評価しているため、限界的な全体的な改善しか得られない。これらの結果から,マルチモーダル CoT は口頭反射を比較的よく処理するが,推論過程を通して深い視覚的内観を維持する能力は欠如していることが示唆された。

論文の概要: Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

関連論文リスト