Fugu-MT 論文翻訳(概要): From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

論文の概要: From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2511.12861v2
Date: Tue, 18 Nov 2025 05:45:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 13:59:16.793214
Title: From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models
Title（参考訳）: 知覚から推論へ:マルチモーダルな大規模言語モデルに深層思考が力を与える
Authors: Wenxin Zhu, Andong Chen, Yuchen Song, Kehai Chen, Conghui Zhu, Ziyan Chen, Tiejun Zhao,
Abstract要約: CoT(Chain-of-Thought)推論は、推論の透明性と出力の解釈可能性を高めることによって、言語モデルにおいて有意な効果を示した。本稿では,Multimodal Chain-of-Thought(MCoT)を中心にしたシステムレビューを行う。
参考スコア（独自算出の注目度）: 36.54062692717823
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the remarkable success of Multimodal Large Language Models (MLLMs) in perception tasks, enhancing their complex reasoning capabilities has emerged as a critical research focus. Existing models still suffer from challenges such as opaque reasoning paths and insufficient generalization ability. Chain-of-Thought (CoT) reasoning, which has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability, holds promise for improving model reasoning capabilities when extended to the multimodal domain. This paper provides a systematic review centered on "Multimodal Chain-of-Thought" (MCoT). First, it analyzes the background and theoretical motivations for its inception from the perspectives of technical evolution and task demands. Then, it introduces mainstream MCoT methods from three aspects: CoT paradigms, the post-training stage, and the inference stage, while also analyzing their underlying mechanisms. Furthermore, the paper summarizes existing evaluation benchmarks and metrics, and discusses the application scenarios of MCoT. Finally, it analyzes the challenges currently facing MCoT and provides an outlook on its future research directions.
Abstract（参考訳）: 知覚タスクにおけるMLLM(Multimodal Large Language Models)の顕著な成功により、その複雑な推論能力の強化が重要な研究の焦点となっている。既存のモデルはいまだに不透明な推論パスや一般化能力の不足といった課題に悩まされている。 CoT(Chain-of-Thought)推論は、推論の透明性と出力の解釈可能性を高めることで言語モデルに顕著な効果を示しており、マルチモーダルドメインに拡張された場合のモデル推論能力の向上を約束している。本稿では,Multimodal Chain-of-Thought(MCoT)を中心に,系統的なレビューを行う。まず、技術的進化とタスク要求の観点から、その開始の背景と理論的動機を分析します。次に、CoTパラダイム、ポストトレーニングステージ、推論ステージの3つの側面からメインストリームのMCoTメソッドを導入し、その基盤となるメカニズムを解析する。さらに、既存の評価ベンチマークとメトリクスを要約し、MCoTの適用シナリオについて論じる。最後に、現在MCoTが直面している課題を分析し、今後の研究方向性を展望する。

論文の概要: From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

関連論文リスト