Thinking on Maps: How Foundation Model Agents Explore, Remember, and Reason Map Environments
- URL: http://arxiv.org/abs/2512.24504v2
- Date: Thu, 01 Jan 2026 21:35:11 GMT
- Title: Thinking on Maps: How Foundation Model Agents Explore, Remember, and Reason Map Environments
- Authors: Zhiwei Wei, Yuxing Liu, Hua Liao, Wenjia Xu,
- Abstract summary: Map environments provide a fundamental medium for representing spatial structure. Understanding how foundation model (FM) agents understand and act in such environments is critical for enabling reliable map-based reasoning and applications.<n>We propose an interactive evaluation framework to analyze how FM agents explore, remember, and reason in symbolic map environments.
- Score: 10.485672302572368
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Map environments provide a fundamental medium for representing spatial structure. Understanding how foundation model (FM) agents understand and act in such environments is therefore critical for enabling reliable map-based reasoning and applications. However, most existing evaluations of spatial ability in FMs rely on static map inputs or text-based queries, overlooking the interactive and experience-driven nature of spatial understanding.In this paper, we propose an interactive evaluation framework to analyze how FM agents explore, remember, and reason in symbolic map environments. Agents incrementally explore partially observable grid-based maps consisting of roads, intersections, and points of interest (POIs), receiving only local observations at each step. Spatial understanding is then evaluated using six kinds of spatial tasks. By systematically varying exploration strategies, memory representations, and reasoning schemes across multiple foundation models, we reveal distinct functional roles of these components. Exploration primarily affects experience acquisition but has a limited impact on final reasoning accuracy. In contrast, memory representation plays a central role in consolidating spatial experience, with structured memories particularly sequential and graph-based representations, substantially improving performance on structure-intensive tasks such as path planning. Reasoning schemes further shape how stored spatial knowledge is used, with advanced prompts supporting more effective multi-step inference. We further observe that spatial reasoning performance saturates across model versions and scales beyond a certain capability threshold, indicating that improvements in map-based spatial understanding require mechanisms tailored to spatial representation and reasoning rather than scaling alone.
Related papers
- OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents [68.85365034738534]
We introduce a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces.<n>The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions.<n>The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split.
arXiv Detail & Related papers (2026-02-19T18:59:54Z) - MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps [22.530685223300523]
MapVerse is a large-scale benchmark built on real-world maps.<n>It comprises 11,837 human-authored question-answer pairs across 1,025 maps.<n>We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps.
arXiv Detail & Related papers (2026-02-11T04:36:14Z) - Observing Health Outcomes Using Remote Sensing Imagery and Geo-Context Guided Visual Transformer [8.825339734603862]
We propose a novel model that enhances remote sensing imagery processing with guidance from auxiliary geospatial information.<n>Our approach introduces a geospatial embedding mechanism that transforms diverse geospatial data into embedding patches that are spatially aligned with image patches.<n>We show that the proposed framework outperforms existing pretrained geospatial foundation models in predicting disease prevalence.
arXiv Detail & Related papers (2026-01-26T22:45:28Z) - Memory Matters More: Event-Centric Memory as a Logic Map for Agent Searching and Reasoning [55.251697395358285]
Large language models (LLMs) are increasingly deployed as intelligent agents that reason, plan, and interact with their environments.<n>To effectively scale to long-horizon scenarios, a key capability for such agents is a memory mechanism that can retain, organize, and retrieve past experiences.<n>We propose CompassMem, an event-centric memory framework inspired by Event Theory.
arXiv Detail & Related papers (2026-01-08T08:44:07Z) - Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation [52.605647992080485]
spatial reasoning advances vision-language models from visual perception toward semantic understanding.<n>We integrate the cognitive concept of an object-centric blueprint into spatial reasoning.<n>Our method consistently outperforms existing vision-language models.
arXiv Detail & Related papers (2026-01-05T10:38:26Z) - Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks [108.15756345836901]
We provide a comprehensive review of multimodal spatial reasoning tasks with large models.<n>We review advances in embodied AI, including vision-language navigation and action models.<n>We consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors.
arXiv Detail & Related papers (2025-10-29T17:55:43Z) - EvoEmpirBench: Dynamic Spatial Reasoning with Agent-ExpVer [5.855255212938064]
We introduce two dynamic spatial benchmarks that evaluate models' abilities in spatial understanding and adaptive planning.<n>Experiments show that our benchmarks reveal key limitations of mainstream models in dynamic spatial reasoning and long-term memory.
arXiv Detail & Related papers (2025-09-16T06:21:38Z) - Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation [50.81551581148339]
We introduce Relevant Reasoning (R$2$S), a reasoning-based segmentation framework.<n>We also introduce 3D ReasonSeg, a reasoning-based segmentation dataset.<n>Both experiments demonstrate that the R$2$S and 3D ReasonSeg effectively endow 3D point cloud perception with stronger spatial reasoning capabilities.
arXiv Detail & Related papers (2025-06-29T06:58:08Z) - FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment [16.987872206495897]
FindAnything is an open-world mapping framework that incorporates vision-language information into dense volumetric submaps.<n>Our system is the first of its kind to be deployed on resource-constrained devices, such as MAVs.
arXiv Detail & Related papers (2025-04-11T15:12:05Z) - From Text to Space: Mapping Abstract Spatial Models in LLMs during a Grid-World Navigation Task [0.0]
We investigate the influence of different text-based spatial representations on large language models (LLMs) performance and internal activations in a grid-world navigation task.<n>Our experiments reveal that cartesian representations of space consistently yield higher success rates and path efficiency, with performance scaling markedly with model size.<n>This work advances our understanding of how LLMs process spatial information and provides valuable insights for developing more interpretable and robust agentic AI systems.
arXiv Detail & Related papers (2025-02-23T19:09:01Z) - SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models [70.01883340129204]
spatial reasoning is a crucial component of both biological and artificial intelligence.
We present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning.
arXiv Detail & Related papers (2024-06-07T01:06:34Z) - Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation [1.2473780585666772]
Most Vision-and-Language Navigation (VLN) algorithms are prone to making inaccurate decisions due to their lack of visual common sense and limited reasoning capabilities.
We propose a Hierarchical Spatial Proximity Reasoning (HSPR) method to help the agent build a knowledge base of hierarchical spatial proximity.
We validate our approach with experiments on publicly available datasets including REVERIE, SOON, R2R, and R4R.
arXiv Detail & Related papers (2024-03-18T07:51:22Z) - Mapping High-level Semantic Regions in Indoor Environments without
Object Recognition [50.624970503498226]
The present work proposes a method for semantic region mapping via embodied navigation in indoor environments.
To enable region identification, the method uses a vision-to-language model to provide scene information for mapping.
By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location.
arXiv Detail & Related papers (2024-03-11T18:09:50Z) - Spatial Pyramid Based Graph Reasoning for Semantic Segmentation [67.47159595239798]
We apply graph convolution into the semantic segmentation task and propose an improved Laplacian.
The graph reasoning is directly performed in the original feature space organized as a spatial pyramid.
We achieve comparable performance with advantages in computational and memory overhead.
arXiv Detail & Related papers (2020-03-23T12:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.