EscapeBench: Pushing Language Models to Think Outside the Box
- URL: http://arxiv.org/abs/2412.13549v1
- Date: Wed, 18 Dec 2024 06:50:39 GMT
- Title: EscapeBench: Pushing Language Models to Think Outside the Box
- Authors: Cheng Qian, Peixuan Han, Qinyu Luo, Bingxiang He, Xiusi Chen, Yuji Zhang, Hongyi Du, Jiarui Yao, Xiaocheng Yang, Denghui Zhang, Yunzhu Li, Heng Ji,
- Abstract summary: We introduce EscapeBench, a benchmark suite of room escape game environments designed to challenge agents with creative reasoning.
Our results show that current LM models, despite employing working memory and Chain-of-Thought reasoning, achieve only 15% average progress without hints.
We propose EscapeAgent, a framework designed to enhance creative reasoning through Foresight (innovative tool use) and Reflection (identifying unsolved tasks)
- Score: 49.44742596224033
- License:
- Abstract: Language model agents excel in long-session planning and reasoning, but existing benchmarks primarily focus on goal-oriented tasks with explicit objectives, neglecting creative adaptation in unfamiliar environments. To address this, we introduce EscapeBench, a benchmark suite of room escape game environments designed to challenge agents with creative reasoning, unconventional tool use, and iterative problem-solving to uncover implicit goals. Our results show that current LM models, despite employing working memory and Chain-of-Thought reasoning, achieve only 15% average progress without hints, highlighting their limitations in creativity. To bridge this gap, we propose EscapeAgent, a framework designed to enhance creative reasoning through Foresight (innovative tool use) and Reflection (identifying unsolved tasks). Experiments show that EscapeAgent can execute action chains over 1,000 steps while maintaining logical coherence. It navigates and completes games with up to 40% fewer steps and hints, performs robustly across varying difficulty levels, and achieves higher action success rates with more efficient and innovative puzzle-solving strategies. All the data and codes are released.
Related papers
- Tell Me What to Track: Infusing Robust Language Guidance for Enhanced Referring Multi-Object Tracking [10.614327633823462]
Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to localize an arbitrary number of targets.
We conduct a collaborative matching strategy to alleviate the impact of the imbalance, boosting the ability to detect newborn targets.
In the encoder, we integrate and enhance the cross-modal and multi-scale fusion, overcoming the bottlenecks in previous work.
arXiv Detail & Related papers (2024-12-17T05:43:35Z) - Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models [27.78471707423076]
We propose a new visual reasoning paradigm enabling MLLMs to autonomously modify the input scene to new ones based on its reasoning status.
We introduce a novel plug-and-play imagination space, where MLLMs conduct visual modifications through operations like focus, ignore, and transform.
We validate our approach through a benchmark spanning dense counting, simple jigsaw puzzle solving, and object placement.
arXiv Detail & Related papers (2024-11-27T08:44:25Z) - KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents [52.348929737851165]
Large Language Models (LLMs) have demonstrated great potential in complex reasoning tasks, yet they fall short when tackling more sophisticated challenges.
This inadequacy primarily stems from the lack of built-in action knowledge in language agents.
We introduce KnowAgent, a novel approach designed to enhance the planning capabilities of LLMs by incorporating explicit action knowledge.
arXiv Detail & Related papers (2024-03-05T16:39:12Z) - Egocentric Planning for Scalable Embodied Task Achievement [6.870094263016224]
Egocentric Planning is an innovative approach that combines symbolic planning and Object-oriented POMDPs to solve tasks in complex environments.
We evaluated our approach in ALFRED, a simulated environment designed for domestic tasks, and demonstrated its high scalability.
Our method requires reliable perception and the specification or learning of a symbolic description of the preconditions and effects of the agent's actions.
arXiv Detail & Related papers (2023-06-02T06:41:24Z) - Discrete Factorial Representations as an Abstraction for Goal
Conditioned Reinforcement Learning [99.38163119531745]
We show that applying a discretizing bottleneck can improve performance in goal-conditioned RL setups.
We experimentally prove the expected return on out-of-distribution goals, while still allowing for specifying goals with expressive structure.
arXiv Detail & Related papers (2022-11-01T03:31:43Z) - H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding
Object Articulations from Interactions [62.510951695174604]
"Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR) is a probabilistic generative framework that generates hypotheses about how objects articulate given input observations.
We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework.
We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
arXiv Detail & Related papers (2022-10-22T18:39:33Z) - Discovering and Achieving Goals via World Models [61.95437238374288]
We introduce Latent Explorer Achiever (LEXA), a unified solution to this problem.
LEXA learns a world model from image inputs and uses it to train an explorer and an achiever policy from imagined rollouts.
After the unsupervised phase, LEXA solves tasks specified as goal images zero-shot without any additional learning.
arXiv Detail & Related papers (2021-10-18T17:59:58Z) - Automatic Curriculum Learning through Value Disagreement [95.19299356298876]
Continually solving new, unsolved tasks is the key to learning diverse behaviors.
In the multi-task domain, where an agent needs to reach multiple goals, the choice of training goals can largely affect sample efficiency.
We propose setting up an automatic curriculum for goals that the agent needs to solve.
We evaluate our method across 13 multi-goal robotic tasks and 5 navigation tasks, and demonstrate performance gains over current state-of-the-art methods.
arXiv Detail & Related papers (2020-06-17T03:58:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.