Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images
- URL: http://arxiv.org/abs/2301.04224v2
- Date: Sun, 9 Apr 2023 21:30:05 GMT
- Title: Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images
- Authors: Xindi Wu, KwunFung Lau, Francesco Ferroni, Aljo\v{s}a O\v{s}ep, Deva
Ramanan
- Abstract summary: We introduce Pix2Map, a method for inferring urban street map topology directly from ego-view images.
This problem can be posed as cross-modal retrieval by learning a joint, cross-modal embedding space for images and existing maps.
We show that our retrieved maps can be used to update or expand existing maps and even show proof-of-concept results.
- Score: 42.05213970259352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-driving vehicles rely on urban street maps for autonomous navigation. In
this paper, we introduce Pix2Map, a method for inferring urban street map
topology directly from ego-view images, as needed to continually update and
expand existing maps. This is a challenging task, as we need to infer a complex
urban road topology directly from raw image data. The main insight of this
paper is that this problem can be posed as cross-modal retrieval by learning a
joint, cross-modal embedding space for images and existing maps, represented as
discrete graphs that encode the topological layout of the visual surroundings.
We conduct our experimental evaluation using the Argoverse dataset and show
that it is indeed possible to accurately retrieve street maps corresponding to
both seen and unseen roads solely from image data. Moreover, we show that our
retrieved maps can be used to update or expand existing maps and even show
proof-of-concept results for visual localization and image retrieval from
spatial graphs.
Related papers
- Neural Semantic Map-Learning for Autonomous Vehicles [85.8425492858912]
We present a mapping system that fuses local submaps gathered from a fleet of vehicles at a central instance to produce a coherent map of the road environment.
Our method jointly aligns and merges the noisy and incomplete local submaps using a scene-specific Neural Signed Distance Field.
We leverage memory-efficient sparse feature-grids to scale to large areas and introduce a confidence score to model uncertainty in scene reconstruction.
arXiv Detail & Related papers (2024-10-10T10:10:03Z) - Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network [12.692812966686066]
Cross-view geolocalization identifies the geographic location of street view images by matching them with a georeferenced satellite database.
We propose a new approach for cross-view image geo-localization, i.e., the Panorama-BEV Co-Retrieval Network.
arXiv Detail & Related papers (2024-08-10T08:03:58Z) - CartoMark: a benchmark dataset for map pattern recognition and 1 map
content retrieval with machine intelligence [9.652629004863364]
We develop a large-scale benchmark dataset for map text annotation recognition, map scene classification, map super-resolution reconstruction, and map style transferring.
These well-labelled datasets would facilitate the state-of-the-art machine intelligence technologies to conduct map feature detection, map pattern recognition and map content retrieval.
arXiv Detail & Related papers (2023-12-14T01:54:38Z) - SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic
Understanding [57.108301842535894]
We introduce SNAP, a deep network that learns rich neural 2D maps from ground-level and overhead images.
We train our model to align neural maps estimated from different inputs, supervised only with camera poses over tens of millions of StreetView images.
SNAP can resolve the location of challenging image queries beyond the reach of traditional methods.
arXiv Detail & Related papers (2023-06-08T17:54:47Z) - Dataset of Pathloss and ToA Radio Maps With Localization Application [59.11388233415274]
The datasets include simulated pathloss/received signal strength ( RSS) and time of arrival ( ToA) radio maps over a large collection of realistic dense urban setting in real city maps.
The two main applications of the presented dataset are 1) learning methods that predict the pathloss from input city maps, and, 2) wireless localization.
The fact that the RSS and ToA maps are computed by the same simulations over the same city maps allows for a fair comparison of the RSS and ToA-based localization methods.
arXiv Detail & Related papers (2022-11-18T20:39:51Z) - A Survey on Visual Map Localization Using LiDARs and Cameras [0.0]
We define visual map localization as a two-stage process.
At the stage of place recognition, the initial position of the vehicle in the map is determined by comparing the visual sensor output with a set of geo-tagged map regions of interest.
At the stage of map metric localization, the vehicle is tracked while it moves across the map by continuously aligning the visual sensors' output with the current area of the map that is being traversed.
arXiv Detail & Related papers (2022-08-05T20:11:18Z) - csBoundary: City-scale Road-boundary Detection in Aerial Images for
High-definition Maps [10.082536828708779]
We propose csBoundary to automatically detect road boundaries at the city scale for HD map annotation.
Our network takes as input an aerial image patch, and directly infers the continuous road-boundary graph from this image.
Our csBoundary is evaluated and compared on a public benchmark dataset.
arXiv Detail & Related papers (2021-11-11T02:04:36Z) - Semantic Image Alignment for Vehicle Localization [111.59616433224662]
We present a novel approach to vehicle localization in dense semantic maps using semantic segmentation from a monocular camera.
In contrast to existing visual localization approaches, the system does not require additional keypoint features, handcrafted localization landmark extractors or expensive LiDAR sensors.
arXiv Detail & Related papers (2021-10-08T14:40:15Z) - Learning Lane Graph Representations for Motion Forecasting [92.88572392790623]
We construct a lane graph from raw map data to preserve the map structure.
We exploit a fusion network consisting of four types of interactions, actor-to-lane, lane-to-lane, lane-to-actor and actor-to-actor.
Our approach significantly outperforms the state-of-the-art on the large scale Argoverse motion forecasting benchmark.
arXiv Detail & Related papers (2020-07-27T17:59:49Z) - Predicting Semantic Map Representations from Images using Pyramid
Occupancy Networks [27.86228863466213]
We present a simple, unified approach for estimating maps directly from monocular images using a single end-to-end deep learning architecture.
We demonstrate the effectiveness of our approach by evaluating against several challenging baselines on the NuScenes and Argoverse datasets.
arXiv Detail & Related papers (2020-03-30T12:39:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.