Can Large Vision Language Models Read Maps Like a Human?
- URL: http://arxiv.org/abs/2503.14607v1
- Date: Tue, 18 Mar 2025 18:05:38 GMT
- Title: Can Large Vision Language Models Read Maps Like a Human?
- Authors: Shuo Xing, Zezhou Sun, Shuangyu Xie, Kaiyuan Chen, Yanjia Huang, Yuping Wang, Jiachen Li, Dezhen Song, Zhengzhong Tu,
- Abstract summary: MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps.<n>In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks.<n>We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework.
- Score: 16.81757312518894
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.
Related papers
- Control Map Distribution using Map Query Bank for Online Map Generation [18.325267388089696]
Reliable autonomous driving systems require high-definition (HD) map for planning and navigation.
OMG has become an alternative low-cost solution to build a local HD map.
OMG learns HD map predictions from an initial map queries distribution.
It is important to keep point-level information in map queries when interacting with BEV feature map.
arXiv Detail & Related papers (2025-04-04T18:47:42Z) - TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation [52.422619828854984]
We introduce TopV-Nav, an MLLM-based method that directly reasons on the top-view map with sufficient spatial information.
To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method.
arXiv Detail & Related papers (2024-11-25T14:27:55Z) - TopoSD: Topology-Enhanced Lane Segment Perception with SDMap Prior [70.84644266024571]
We propose to train a perception model to "see" standard definition maps (SDMaps)
We encode SDMap elements into neural spatial map representations and instance tokens, and then incorporate such complementary features as prior information.
Based on the lane segment representation framework, the model simultaneously predicts lanes, centrelines and their topology.
arXiv Detail & Related papers (2024-11-22T06:13:42Z) - Tag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models [15.454856838083511]
Large Language Models (LLM) have emerged as a tool for robots to generate task plans using common sense reasoning.
Recent works have shifted from explicit maps with fixed semantic classes to implicit open vocabulary maps.
We propose an explicit text-based map that can represent thousands of semantic classes while easily integrating with LLMs.
arXiv Detail & Related papers (2024-09-23T18:26:19Z) - Language-enhanced RNR-Map: Querying Renderable Neural Radiance Field
maps with natural language [51.805056586678184]
We present a Language-enhanced Renderable Neural Radiance map for Visual Navigation with natural language query prompts.
Le-RNR-Map employs a grid structure comprising latent codes positioned at each pixel.
We enhance RNR-Map with CLIP-based embedding latent codes, allowing natural language search without additional label data.
arXiv Detail & Related papers (2023-08-17T08:27:01Z) - Grid Cell-Inspired Fragmentation and Recall for Efficient Map Building [29.630483662400444]
We propose and apply the concept of Fragmentation-and-Recall (FARMap) in the mapping of large spaces.
Agents solve the mapping problem by building local maps via a surprisal-based clustering of space.
We demonstrate that FARMap replicates the fragmentation points observed in animal studies.
arXiv Detail & Related papers (2023-07-11T20:40:19Z) - BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN)
We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.
Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z) - Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z) - Visual Language Maps for Robot Navigation [30.33041779258644]
Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visual-language models pretrained on Internet-scale data.
We propose VLMaps, a spatial map representation that directly fuses pretrained visual-language features with a 3D reconstruction of the physical world.
arXiv Detail & Related papers (2022-10-11T18:13:20Z) - Long-term Visual Map Sparsification with Heterogeneous GNN [47.12309045366042]
In this paper, we aim to overcome the environmental changes and reduce the map size at the same time by selecting points that are valuable to future localization.
Inspired by the recent progress in Graph Neural Network(GNN), we propose the first work that models SfM maps as heterogeneous graphs and predicts 3D point importance scores with a GNN.
Two novel supervisions are proposed: 1) a data-fitting term for selecting valuable points to future localization based on training queries; 2) a K-Cover term for selecting sparse points with full map coverage.
arXiv Detail & Related papers (2022-03-29T01:46:12Z) - An Automatic Approach for Generating Rich, Linked Geo-Metadata from
Historical Map Images [6.962949867017594]
This paper presents an end-to-end approach to address the real-world problem of finding and indexing historical map images.
We have implemented the approach in a system called mapKurator.
arXiv Detail & Related papers (2021-12-03T01:44:38Z) - Rethinking Localization Map: Towards Accurate Object Perception with
Self-Enhancement Maps [78.2581910688094]
This work introduces a novel self-enhancement method to harvest accurate object localization maps and object boundaries with only category labels as supervision.
In particular, the proposed Self-Enhancement Maps achieve the state-of-the-art localization accuracy of 54.88% on ILSVRC.
arXiv Detail & Related papers (2020-06-09T12:35:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.