Related papers: MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models

MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models

URL: http://arxiv.org/abs/2507.20395v1
Date: Sun, 27 Jul 2025 19:33:45 GMT
Title: MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models
Authors: Hafsteinn Einarsson,
Abstract summary: This paper introduces MazeEval, a benchmark designed to isolate and evaluate pure spatial reasoning in Large Language Models.<n>We evaluate eight state-of-the-art LLMs across identical mazes in both English and Icelandic to assess cross-linguistic transfer of spatial abilities.
Score: 0.0679877553227375
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As Large Language Models (LLMs) increasingly power autonomous agents in robotics and embodied AI, understanding their spatial reasoning capabilities becomes crucial for ensuring reliable real-world deployment. Despite advances in language understanding, current research lacks evaluation of how LLMs perform spatial navigation without visual cues, a fundamental requirement for agents operating with limited sensory information. This paper addresses this gap by introducing MazeEval, a benchmark designed to isolate and evaluate pure spatial reasoning in LLMs through coordinate-based maze navigation tasks. Our methodology employs a function-calling interface where models navigate mazes of varying complexity ($5\times 5$ to $15\times 15$ grids) using only coordinate feedback and distance-to-wall information, excluding visual input to test fundamental spatial cognition. We evaluate eight state-of-the-art LLMs across identical mazes in both English and Icelandic to assess cross-linguistic transfer of spatial abilities. Our findings reveal striking disparities: while OpenAI's O3 achieves perfect navigation for mazes up to size $30\times 30$, other models exhibit catastrophic failure beyond $9\times 9$ mazes, with 100% of failures attributed to excessive looping behavior where models revisit a cell at least 10 times. We document a significant performance degradation in Icelandic, with models solving mazes 3-4 sizes smaller than in English, suggesting spatial reasoning in LLMs emerges from linguistic patterns rather than language-agnostic mechanisms. These results have important implications for global deployment of LLM-powered autonomous systems, showing spatial intelligence remains fundamentally constrained by training data availability and highlighting the need for architectural innovations to achieve reliable navigation across linguistic contexts.

Related papers

OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding [53.33067495235966]
OnlineSI is a framework that can improve its spatial understanding of its surroundings given a video stream.<n>Our core idea is to maintain a finite spatial memory to retain past observations.<n>We further integrate 3D point cloud information with semantic information, helping MLLM to better locate and identify objects in the scene.
arXiv Detail & Related papers (2026-01-23T08:17:57Z)
From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs [65.04549036809557]
We introduce a benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors.<n>This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions.<n> Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings.
arXiv Detail & Related papers (2025-12-22T18:58:12Z)
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation [40.999498663740454]
multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks.<n>GroundingME systematically challenges models across four critical dimensions.<n> evaluating 25 state-of-the-art MLLMs reveals a profound capability gap.
arXiv Detail & Related papers (2025-12-19T12:06:25Z)
Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation [59.40886078302025]
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs.<n>Yet, the extent to which generated tokens depend on visual modalities remains poorly understood.<n>We present a lightweight black-box framework for explaining autoregressive token generation in MLLMs.
arXiv Detail & Related papers (2025-09-26T15:38:42Z)
Bridging Perception and Language: A Systematic Benchmark for LVLMs' Understanding of Amodal Completion Reports [10.925743866700037]
We study the inferential abilities of large vision-language models on texts related to amodal completion.<n>Our results indicate that while many LVLMs achieve human-comparable performance overall, their accuracy diverges for certain types of objects being completed.<n>Intriguingly, this disparity emerges only under Japanese prompting, suggesting a deficiency in Japanese-specific linguistic competence among these models.
arXiv Detail & Related papers (2025-07-08T09:06:47Z)
Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z)
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z)
From Text to Space: Mapping Abstract Spatial Models in LLMs during a Grid-World Navigation Task [0.0]
We investigate the influence of different text-based spatial representations on large language models (LLMs) performance and internal activations in a grid-world navigation task.<n>Our experiments reveal that cartesian representations of space consistently yield higher success rates and path efficiency, with performance scaling markedly with model size.<n>This work advances our understanding of how LLMs process spatial information and provides valuable insights for developing more interpretable and robust agentic AI systems.
arXiv Detail & Related papers (2025-02-23T19:09:01Z)
Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation [35.71602601385161]
We present a novel vision-language model (VLM)-based navigation framework.<n>Our approach enhances spatial reasoning and decision-making in long-horizon tasks.<n> Experimental results demonstrate that the proposed method surpasses previous state-of-the-art approaches in object navigation tasks.
arXiv Detail & Related papers (2025-02-20T04:41:40Z)
Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback [52.763620660061115]
ONI is a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function.<n>We explore a range of algorithmic choices for reward modeling with varying complexity.<n>Our approach achieves state-of-the-art performance across a range of challenging tasks from the NetHack Learning Environment.
arXiv Detail & Related papers (2024-10-30T13:52:43Z)
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input. Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints. We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
DIALIGHT: Lightweight Multilingual Development and Evaluation of Task-Oriented Dialogue Systems with Large Language Models [76.79929883963275]
DIALIGHT is a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems. It features a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level. Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses.
arXiv Detail & Related papers (2024-01-04T11:27:48Z)
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models [31.509994889286183]
We introduce Language Agent Tree Search (LATS) -- the first general framework that synergizes the capabilities of language models (LMs) in reasoning, acting, and planning. A key feature of our approach is the incorporation of an environment for external feedback, which offers a more deliberate and adaptive problem-solving mechanism. LATS achieves state-of-the-art pass@1 accuracy (92.7%) for programming on HumanEval with GPT-4 and demonstrates gradient-free performance (average score of 75.9) comparable to gradient-based fine-tuning for web navigation on WebShop with GPT
arXiv Detail & Related papers (2023-10-06T17:55:11Z)
Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks. We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset. The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z)
Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving [91.91552963872596]
We propose a new multi-modal visual grounding task, termed LiDAR Grounding. It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector. Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.
arXiv Detail & Related papers (2023-05-25T06:22:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.