How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game
- URL: http://arxiv.org/abs/2503.10042v1
- Date: Thu, 13 Mar 2025 04:48:43 GMT
- Title: How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game
- Authors: Ziyue Wang, Yurui Dong, Fuwen Luo, Minyuan Ruan, Zhili Cheng, Chi Chen, Peng Li, Yang Liu,
- Abstract summary: We introduce MM-Escape, a benchmark for investigating multimodal reasoning.<n> MM-Escape emphasizes intermediate model behaviors alongside final task completion.<n>Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks.<n>We observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities.
- Score: 11.721839449847472
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancing of Multimodal Large Language Models (MLLMs) has spurred interest in complex multimodal reasoning tasks in the real-world and virtual environment, which require coordinating multiple abilities, including visual perception, visual reasoning, spatial awareness, and target deduction. However, existing evaluations primarily assess the final task completion, often degrading assessments to isolated abilities such as visual grounding and visual question answering. Less attention is given to comprehensively and quantitatively analyzing reasoning process in multimodal environments, which is crucial for understanding model behaviors and underlying reasoning mechanisms beyond merely task success. To address this, we introduce MM-Escape, an extensible benchmark for investigating multimodal reasoning, inspired by real-world escape games. MM-Escape emphasizes intermediate model behaviors alongside final task completion. To achieve this, we develop EscapeCraft, a customizable and open environment that enables models to engage in free-form exploration for assessing multimodal reasoning. Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks, with some exhibiting human-like exploration strategies. Yet, performance dramatically drops as task difficulty increases. Moreover, we observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities, such as repetitive trajectories without adaptive exploration, getting stuck in corners due to poor visual spatial awareness, and ineffective use of acquired props, such as the key. We hope our work sheds light on new challenges in multimodal reasoning, and uncovers potential improvements in MLLMs capabilities.
Related papers
- V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
V-MAGE is a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs.
We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning.
arXiv Detail & Related papers (2025-04-08T15:43:01Z) - Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [70.74453180101365]
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)<n>We propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT)<n>It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.
arXiv Detail & Related papers (2025-01-13T18:23:57Z) - Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.<n>EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.<n>Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z) - Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)<n>We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.<n>We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z) - Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models [27.78471707423076]
We propose a new visual reasoning paradigm enabling MLLMs to autonomously modify the input scene to new ones based on its reasoning status.<n>We introduce a novel plug-and-play imagination space, where MLLMs conduct visual modifications through operations like focus, ignore, and transform.<n>We validate our approach through a benchmark spanning dense counting, simple jigsaw puzzle solving, and object placement.
arXiv Detail & Related papers (2024-11-27T08:44:25Z) - BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games [44.16513620589459]
We introduce BALROG, a novel benchmark to assess the agentic capabilities of Large Language Models (LLMs) and Vision Language Models (VLMs)
Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master.
Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks.
arXiv Detail & Related papers (2024-11-20T18:54:32Z) - ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models [18.992215985625492]
We evaluate active perception in Multimodal Large Language Models (MLLMs)
We focus on a specialized form of Visual Question Answering (VQA) that eases the evaluation yet challenging for existing MLLMs.
We observe that the ability to read and comprehend multiple images simultaneously plays a significant role in enabling active perception.
arXiv Detail & Related papers (2024-10-07T00:16:26Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Exploring Missing Modality in Multimodal Egocentric Datasets [89.76463983679058]
We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent.
Our method mitigates the performance loss, reducing it from its original $sim 30%$ drop to only $sim 10%$ when half of the test set is modal-incomplete.
arXiv Detail & Related papers (2024-01-21T11:55:42Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.<n>Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.<n>We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.