AutoTour: Automatic Photo Tour Guide with Smartphones and LLMs
- URL: http://arxiv.org/abs/2601.06781v1
- Date: Sun, 11 Jan 2026 05:13:39 GMT
- Title: AutoTour: Automatic Photo Tour Guide with Smartphones and LLMs
- Authors: Huatao Xu, Zihe Liu, Zilin Zeng, Baichuan Li, Mo Li,
- Abstract summary: We present AutoTour, a system that enhances user exploration by automatically generating fine-grained landmark annotations and descriptive narratives for photos captured by users.<n>Key idea of AutoTour is to fuse visual features extracted from photos with nearby geospatial features queried from open matching databases.<n>We demonstrate that AutoTour can deliver rich, interpretable annotations for both iconic and lesser-known landmarks, enabling a new form of interactive, context-aware exploration.
- Score: 4.443162611503121
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present AutoTour, a system that enhances user exploration by automatically generating fine-grained landmark annotations and descriptive narratives for photos captured by users. The key idea of AutoTour is to fuse visual features extracted from photos with nearby geospatial features queried from open matching databases. Unlike existing tour applications that rely on pre-defined content or proprietary datasets, AutoTour leverages open and extensible data sources to provide scalable and context-aware photo-based guidance. To achieve this, we design a training-free pipeline that first extracts and filters relevant geospatial features around the user's GPS location. It then detects major landmarks in user photos through VLM-based feature detection and projects them into the horizontal spatial plane. A geometric matching algorithm aligns photo features with corresponding geospatial entities based on their estimated distance and direction. The matched features are subsequently grounded and annotated directly on the original photo, accompanied by large language model-generated textual and audio descriptions to provide an informative, tour-like experience. We demonstrate that AutoTour can deliver rich, interpretable annotations for both iconic and lesser-known landmarks, enabling a new form of interactive, context-aware exploration that bridges visual perception and geospatial understanding.
Related papers
- Spatial Retrieval Augmented Autonomous Driving [81.39665750557526]
Existing autonomous driving systems rely on onboard sensors for environmental perception.<n>We propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input.<n>We will open-source dataset curation code, data, and benchmarks for further study of this new autonomous driving paradigm.
arXiv Detail & Related papers (2025-12-07T14:40:49Z) - DescribeEarth: Describe Anything for Remote Sensing Images [56.04533626223295]
We propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing.<n>To support this task, we construct DE-Dataset, a large-scale dataset with detailed descriptions of object attributes, relationships, and contexts.<n>We also present DescribeEarth, a Multi-modal Large Language Model architecture explicitly designed for Geo-DLC.
arXiv Detail & Related papers (2025-09-30T01:53:34Z) - IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence [36.703562827382655]
We introduce IMAIA, an interactive Maps AI Assistant.<n>It enables natural-language interaction with both vector (street) maps and satellite imagery.<n>It augments camera inputs with geospatial intelligence to help users understand the world.
arXiv Detail & Related papers (2025-07-09T16:18:09Z) - GPS as a Control Signal for Image Generation [95.43433150105385]
We show that the GPS tags contained in photo metadata provide a useful control signal for image generation.<n>We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images vary within a city.
arXiv Detail & Related papers (2025-01-21T18:59:46Z) - Pole-based Vehicle Localization with Vector Maps: A Camera-LiDAR Comparative Study [6.300346102366891]
In road environments, many common furniture such as traffic signs, traffic lights and street lights take the form of poles.<n>This paper introduces a real-time method for camera-based pole detection using a lightweight neural network trained on automatically annotated images.<n>The results highlight the high accuracy of the vision-based approach in open road conditions.
arXiv Detail & Related papers (2024-12-11T09:05:05Z) - Continuous Self-Localization on Aerial Images Using Visual and Lidar
Sensors [25.87104194833264]
We propose a novel method for geo-tracking in outdoor environments by registering a vehicle's sensor information with aerial imagery of an unseen target region.
We train a model in a metric learning setting to extract visual features from ground and aerial images.
Our method is the first to utilize on-board cameras in an end-to-end differentiable model for metric self-localization on unseen orthophotos.
arXiv Detail & Related papers (2022-03-07T12:25:44Z) - Semantic Image Alignment for Vehicle Localization [111.59616433224662]
We present a novel approach to vehicle localization in dense semantic maps using semantic segmentation from a monocular camera.
In contrast to existing visual localization approaches, the system does not require additional keypoint features, handcrafted localization landmark extractors or expensive LiDAR sensors.
arXiv Detail & Related papers (2021-10-08T14:40:15Z) - Structured Bird's-Eye-View Traffic Scene Understanding from Onboard
Images [128.881857704338]
We study the problem of extracting a directed graph representing the local road network in BEV coordinates, from a single onboard camera image.
We show that the method can be extended to detect dynamic objects on the BEV plane.
We validate our approach against powerful baselines and show that our network achieves superior performance.
arXiv Detail & Related papers (2021-10-05T12:40:33Z) - Self-supervised Segmentation via Background Inpainting [96.10971980098196]
We introduce a self-supervised detection and segmentation approach that can work with single images captured by a potentially moving camera.
We exploit a self-supervised loss function that we exploit to train a proposal-based segmentation network.
We apply our method to human detection and segmentation in images that visually depart from those of standard benchmarks and outperform existing self-supervised methods.
arXiv Detail & Related papers (2020-11-11T08:34:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.