IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence
- URL: http://arxiv.org/abs/2507.06993v3
- Date: Tue, 23 Sep 2025 03:59:02 GMT
- Title: IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence
- Authors: Jieren Deng, Zhizhang Hu, Ziyan He, Aleksandar Cvetkovic, Pak Kiu Chung, Dragomir Yankov, Chiqun Zhang,
- Abstract summary: We introduce IMAIA, an interactive Maps AI Assistant.<n>It enables natural-language interaction with both vector (street) maps and satellite imagery.<n>It augments camera inputs with geospatial intelligence to help users understand the world.
- Score: 36.703562827382655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Map applications are still largely point-and-click, making it difficult to ask map-centric questions or connect what a camera sees to the surrounding geospatial context with view-conditioned inputs. We introduce IMAIA, an interactive Maps AI Assistant that enables natural-language interaction with both vector (street) maps and satellite imagery, and augments camera inputs with geospatial intelligence to help users understand the world. IMAIA comprises two complementary components. Maps Plus treats the map as first-class context by parsing tiled vector/satellite views into a grid-aligned representation that a language model can query to resolve deictic references (e.g., ``the flower-shaped building next to the park in the top-right''). Places AI Smart Assistant (PAISA) performs camera-aware place understanding by fusing image--place embeddings with geospatial signals (location, heading, proximity) to ground a scene, surface salient attributes, and generate concise explanations. A lightweight multi-agent design keeps latency low and exposes interpretable intermediate decisions. Across map-centric QA and camera-to-place grounding tasks, IMAIA improves accuracy and responsiveness over strong baselines while remaining practical for user-facing deployments. By unifying language, maps, and geospatial cues, IMAIA moves beyond scripted tools toward conversational mapping that is both spatially grounded and broadly usable.
Related papers
- CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding [5.925837407110905]
We introduce CartoMapQA, a benchmark to evaluate Visual-Language Models' understanding of cartographic maps.<n>The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer.
arXiv Detail & Related papers (2025-12-03T08:25:22Z) - DescribeEarth: Describe Anything for Remote Sensing Images [56.04533626223295]
We propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing.<n>To support this task, we construct DE-Dataset, a large-scale dataset with detailed descriptions of object attributes, relationships, and contexts.<n>We also present DescribeEarth, a Multi-modal Large Language Model architecture explicitly designed for Geo-DLC.
arXiv Detail & Related papers (2025-09-30T01:53:34Z) - Plan Your Travel and Travel with Your Plan: Wide-Horizon Planning and Evaluation via LLM [58.50687282180444]
Travel planning is a complex task requiring the integration of diverse real-world information and user preferences.<n>We formulate this as an $L3$ planning problem, emphasizing long context, long instruction, and long output.<n>We introduce Multiple Aspects of Planning (MAoP), enabling LLMs to conduct wide-horizon thinking to solve complex planning problems.
arXiv Detail & Related papers (2025-06-14T09:37:59Z) - EarthMapper: Visual Autoregressive Models for Controllable Bidirectional Satellite-Map Translation [50.433911327489554]
We introduce EarthMapper, a novel framework for controllable satellite-map translation.<n>We also contribute CNSatMap, a large-scale dataset comprising 302,132 precisely aligned satellite-map pairs across 38 Chinese cities.<n> experiments on CNSatMap and the New York dataset demonstrate EarthMapper's superior performance.
arXiv Detail & Related papers (2025-04-28T02:41:12Z) - TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning [39.934634038758404]
This paper introduces TP-RAG, the first benchmark tailored retrieval-augmentedtemporalRAG-aware travel planning.<n>Our dataset includes 2,348 real-world travel queries, 85,575 fine-grain POIs, 18,784 annotated POIs.
arXiv Detail & Related papers (2025-04-11T17:02:40Z) - FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment [16.987872206495897]
FindAnything is an open-world mapping framework that incorporates vision-language information into dense volumetric submaps.<n>Our system is the first of its kind to be deployed on resource-constrained devices, such as MAVs.
arXiv Detail & Related papers (2025-04-11T15:12:05Z) - PEACE: Empowering Geologic Map Holistic Understanding with MLLMs [64.58959634712215]
Geologic map, as a fundamental diagram in geology science, provides critical insights into the structure and composition of Earth's subsurface and surface.<n>Despite their significance, current Multimodal Large Language Models (MLLMs) often fall short in geologic map understanding.<n>To quantify this gap, we construct GeoMap-Bench, the first-ever benchmark for evaluating MLLMs in geologic map understanding.
arXiv Detail & Related papers (2025-01-10T18:59:42Z) - Semantic Mapping in Indoor Embodied AI -- A Survey on Advances, Challenges, and Future Directions [10.655606296339055]
A semantic map captures information about the environment in a structured way, allowing the agent to reference it for advanced reasoning.<n>This paper provides a review of semantic map-building approaches in embodied AI, specifically for indoor navigation.<n>We identify that the field is moving towards developing open-vocabulary, queryable, task-agnostic map representations.
arXiv Detail & Related papers (2025-01-10T06:58:14Z) - MapExplorer: New Content Generation from Low-Dimensional Visualizations [60.02149343347818]
Low-dimensional visualizations, or "projection maps," are widely used to interpret large-scale and complex datasets.<n>These visualizations not only aid in understanding existing knowledge spaces but also implicitly guide exploration into unknown areas.<n>We introduce MapExplorer, a novel knowledge discovery task that translates coordinates within any projection map into coherent, contextually aligned textual content.
arXiv Detail & Related papers (2024-12-24T20:16:13Z) - TravelAgent: An AI Assistant for Personalized Travel Planning [36.046107116324826]
We introduce TravelAgent, a travel planning system powered by large language models (LLMs)
TravelAgent comprises four modules: Tool-usage, Recommendation, Planning, and Memory Module.
We evaluate TravelAgent's performance with human and simulated users, demonstrating its overall effectiveness in three criteria and confirming the accuracy of personalized recommendations.
arXiv Detail & Related papers (2024-09-12T14:24:45Z) - Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions [19.03156236107806]
This paper introduces a novel agentic workflow featured by its abilities to perceive, reflect and plan.
We find LLaVA-7B can be fine-tuned to perceive the direction and distance of landmarks with sufficient accuracy for city navigation.
arXiv Detail & Related papers (2024-08-08T02:28:43Z) - Evaluating Tool-Augmented Agents in Remote Sensing Platforms [1.8434042562191815]
Existing benchmarks assume question-answering input templates over predefined image-text data pairs.
We present GeoLLM-QA, a benchmark designed to capture long sequences of verbal, visual, and click-based actions on a real UI platform.
arXiv Detail & Related papers (2024-04-23T20:37:24Z) - MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation [73.81268591484198]
Embodied agents equipped with GPT have exhibited extraordinary decision-making and generalization abilities across various tasks.
We present a novel map-guided GPT-based agent, dubbed MapGPT, which introduces an online linguistic-formed map to encourage global exploration.
Benefiting from this design, we propose an adaptive planning mechanism to assist the agent in performing multi-step path planning based on a map.
arXiv Detail & Related papers (2024-01-14T15:34:48Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z) - ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints [94.60414567852536]
Long-range navigation requires both planning and reasoning about local traversability.
We propose a learning-based approach that integrates learning and planning.
ViKiNG can leverage its image-based learned controller and goal-directed to navigate to goals up to 3 kilometers away.
arXiv Detail & Related papers (2022-02-23T02:14:23Z) - Self-Supervised Domain Adaptation for Visual Navigation with Global Map
Consistency [6.385006149689549]
We propose a self-supervised adaptation for a visual navigation agent to generalize to unseen environment.
The proposed task is completely self-supervised, not requiring any supervision from ground-truth pose data or explicit noise model.
Our experiments show that the proposed task helps the agent to successfully transfer to new, noisy environments.
arXiv Detail & Related papers (2021-10-14T07:14:36Z) - MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation [23.877609358505268]
Recent work shows that map-like memory is useful for long-horizon navigation tasks.
We propose the multiON task, which requires navigation to an episode-specific sequence of objects in a realistic environment.
We examine how a variety of agent models perform across a spectrum of navigation task complexities.
arXiv Detail & Related papers (2020-12-07T18:42:38Z) - Occupancy Anticipation for Efficient Exploration and Navigation [97.17517060585875]
We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions.
By exploiting context in both the egocentric views and top-down maps our model successfully anticipates a broader map of the environment.
Our approach is the winning entry in the 2020 Habitat PointNav Challenge.
arXiv Detail & Related papers (2020-08-21T03:16:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.