VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms
- URL: http://arxiv.org/abs/2503.14427v2
- Date: Sat, 22 Mar 2025 05:06:18 GMT
- Title: VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms
- Authors: Seungwon Lim, Sungwoong Kim, Jihwan Yu, Sungjae Lee, Jiwan Chung, Youngjae Yu,
- Abstract summary: We introduce VisEscape, a benchmark of 20 virtual escape rooms specifically designed to evaluate AI models under challenging conditions.<n>On VisEscape, we observe that even state-of-the-art multimodal models generally fail to escape the rooms, showing considerable variation in their levels of progress and trajectories.<n>We propose VisEscaper, which effectively integrates Memory, Feedback, and ReAct modules, demonstrating significant improvements by performing 3.7 times more effectively and 4.9 times more efficiently on average compared to baseline agents.
- Score: 19.642395585971194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Escape rooms present a unique cognitive challenge that demands exploration-driven planning: players should actively search their environment, continuously update their knowledge based on new discoveries, and connect disparate clues to determine which elements are relevant to their objectives. Motivated by this, we introduce VisEscape, a benchmark of 20 virtual escape rooms specifically designed to evaluate AI models under these challenging conditions, where success depends not only on solving isolated puzzles but also on iteratively constructing and refining spatial-temporal knowledge of a dynamically changing environment. On VisEscape, we observe that even state-of-the-art multimodal models generally fail to escape the rooms, showing considerable variation in their levels of progress and trajectories. To address this issue, we propose VisEscaper, which effectively integrates Memory, Feedback, and ReAct modules, demonstrating significant improvements by performing 3.7 times more effectively and 4.9 times more efficiently on average compared to baseline agents.
Related papers
- Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering [87.76784654371312]
Embodied Question Answering requires agents to dynamically explore 3D environments, actively gather visual information, and perform multi-step reasoning to answer questions.<n>Existing datasets often introduce biases or prior knowledge, leading to disembodied reasoning.<n>We construct the largest dataset designed specifically to evaluate both exploration and reasoning capabilities.
arXiv Detail & Related papers (2025-03-14T06:29:47Z) - How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game [11.721839449847472]
We introduce MM-Escape, a benchmark for investigating multimodal reasoning.<n> MM-Escape emphasizes intermediate model behaviors alongside final task completion.<n>Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks.<n>We observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities.
arXiv Detail & Related papers (2025-03-13T04:48:43Z) - AirRoom: Objects Matter in Room Reidentification [4.386378218714507]
AirRoom is an object-aware pipeline that integrates multi-level object-oriented information.<n>AirRoom outperforms state-of-the-art (SOTA) models across nearly all evaluation metrics.
arXiv Detail & Related papers (2025-03-03T03:20:08Z) - EscapeBench: Pushing Language Models to Think Outside the Box [49.44742596224033]
We introduce EscapeBench, a benchmark suite of room escape game environments designed to challenge agents with creative reasoning.<n>Our results show that current LM models, despite employing working memory and Chain-of-Thought reasoning, achieve only 15% average progress without hints.<n>We propose EscapeAgent, a framework designed to enhance creative reasoning through Foresight (innovative tool use) and Reflection (identifying unsolved tasks)
arXiv Detail & Related papers (2024-12-18T06:50:39Z) - Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models [61.899791071654654]
We introduce a benchmark, Q-Spatial Bench, with 271 questions across five categories designed for quantitative spatial reasoning.
We investigate the performance of state-of-the-art vision-language models (VLMs) on this task.
We develop a zero-shot prompting technique, SpatialPrompt, that encourages VLMs to answer quantitative spatial questions using reference objects as visual cues.
arXiv Detail & Related papers (2024-09-15T16:45:42Z) - HAZARD Challenge: Embodied Decision Making in Dynamically Changing
Environments [93.94020724735199]
HAZARD consists of three unexpected disaster scenarios, including fire, flood, and wind.
This benchmark enables us to evaluate autonomous agents' decision-making capabilities across various pipelines.
arXiv Detail & Related papers (2024-01-23T18:59:43Z) - Accelerating exploration and representation learning with offline
pre-training [52.6912479800592]
We show that exploration and representation learning can be improved by separately learning two different models from a single offline dataset.
We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward can significantly improve the sample efficiency on the challenging NetHack benchmark.
arXiv Detail & Related papers (2023-03-31T18:03:30Z) - Batch Exploration with Examples for Scalable Robotic Reinforcement
Learning [63.552788688544254]
Batch Exploration with Examples (BEE) explores relevant regions of the state-space guided by a modest number of human provided images of important states.
BEE is able to tackle challenging vision-based manipulation tasks both in simulation and on a real Franka robot.
arXiv Detail & Related papers (2020-10-22T17:49:25Z) - Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning [12.76337275628074]
In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality andgenerativeity.
We derive an upper bound of the negative log-likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration.
Our method outperforms several state-of-the-art environment model-based exploration approaches.
arXiv Detail & Related papers (2020-10-17T09:54:51Z) - Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal.
One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments.
We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.