Fugu-MT 論文翻訳(概要): From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

論文の概要: From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2509.25373v1
Date: Mon, 29 Sep 2025 18:25:40 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.265251
Title: From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
Title（参考訳）: 認識から認知へ:多モーダル大言語モデルにおける視覚・言語対話型推論に関する調査
Authors: Chenyue Zhou, Mingxuan Wang, Yanbiao Ma, Chenxu Wu, Wanyi Chen, Zhe Qian, Xinyu Liu, Yiwei Zhang, Junhao Wang, Hengbo Xu, Fei Luo, Xiaohua Chen, Xiaoshuai Hao, Hehan Li, Andi Zhang, Wenxuan Wang, Lingling Li, Zhiwu Lu, Yang Lu, Yike Guo,
Abstract要約: MLLM(Multimodal Large Language Models)は、物理的世界に対する深い人間的な理解と相互作用を達成するための試みである。情報取得(知覚)や推論(認知)を行う際、しばしば浅く不整合な統合を示す。この調査では、新しい統合分析フレームワーク「知覚から認知へ」を紹介した。
参考スコア（独自算出の注目度）: 59.85951092642609
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、物理的世界との深い人間的な理解と相互作用を達成するために努力するが、情報(知覚)を取得し、推論(認知)を行う際には、浅く不整合な統合を示すことが多い。この解離は、幻覚がもっとも顕著な理由づけの失敗へと繋がる。ピクセルを処理する能力はまだ、一貫性のある信頼性のある内部世界モデルを構築する能力を提供していない。この課題を体系的に解き、解決するために、この調査では、新しく統一された分析フレームワークである ``From Perception to Cognitionを紹介した。「我々は、視覚的対話的理解の複雑な過程を、2つの相互依存層に分解する:知覚、視覚的情報を正確に抽出し、テキストの指示ときめ細かなアライメントを達成できる基礎的能力、そして認知、この知覚的基礎の上に構築された積極的、多段階的、目標志向的推論のための高次能力、そしてその中核は動的観察的思考的推論ループの形成である。本稿では,両層における現在のMLLMのボトルネックを系統的に解析する。これらの課題に対処するために設計された最先端の手法の展望を調査し、低レベルの視覚的表現を向上する技術から高レベルの推論パラダイムを改善する技術にまたがる。さらに、重要なベンチマークをレビューし、今後の研究の方向性を概説する。本調査は、現在のMLLMの本質的な限界を理解するための明確で構造化された視点を研究コミュニティに提供することを目的としており、世界の深い推論と真に理解できる次世代モデルの構築への道筋を照らすことを目的としている。

論文の概要: From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

関連論文リスト