C-BEV: Contrastive Bird's Eye View Training for Cross-View Image
Retrieval and 3-DoF Pose Estimation
- URL: http://arxiv.org/abs/2312.08060v1
- Date: Wed, 13 Dec 2023 11:14:57 GMT
- Title: C-BEV: Contrastive Bird's Eye View Training for Cross-View Image
Retrieval and 3-DoF Pose Estimation
- Authors: Florian Fervers, Sebastian Bullinger, Christoph Bodensteiner, Michael
Arens, Rainer Stiefelhagen
- Abstract summary: We propose a novel trainable retrieval architecture that uses bird's eye view (BEV) maps rather than vectors as embedding representation.
Our method C-BEV surpasses the state-of-the-art on the retrieval task on multiple datasets by a large margin.
- Score: 27.870926763424848
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To find the geolocation of a street-view image, cross-view geolocalization
(CVGL) methods typically perform image retrieval on a database of georeferenced
aerial images and determine the location from the visually most similar match.
Recent approaches focus mainly on settings where street-view and aerial images
are preselected to align w.r.t. translation or orientation, but struggle in
challenging real-world scenarios where varying camera poses have to be matched
to the same aerial image. We propose a novel trainable retrieval architecture
that uses bird's eye view (BEV) maps rather than vectors as embedding
representation, and explicitly addresses the many-to-one ambiguity that arises
in real-world scenarios. The BEV-based retrieval is trained using the same
contrastive setting and loss as classical retrieval.
Our method C-BEV surpasses the state-of-the-art on the retrieval task on
multiple datasets by a large margin. It is particularly effective in
challenging many-to-one scenarios, e.g. increasing the top-1 recall on VIGOR's
cross-area split with unknown orientation from 31.1% to 65.0%. Although the
model is supervised only through a contrastive objective applied on image
pairings, it additionally learns to infer the 3-DoF camera pose on the matching
aerial image, and even yields a lower mean pose error than recent methods that
are explicitly trained with metric groundtruth.
Related papers
- Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network [12.692812966686066]
Cross-view geolocalization identifies the geographic location of street view images by matching them with a georeferenced satellite database.
We propose a new approach for cross-view image geo-localization, i.e., the Panorama-BEV Co-Retrieval Network.
arXiv Detail & Related papers (2024-08-10T08:03:58Z) - Breaking the Frame: Visual Place Recognition by Overlap Prediction [53.17564423756082]
We propose a novel visual place recognition approach based on overlap prediction, called VOP.
VOP proceeds co-visible image sections by obtaining patch-level embeddings using a Vision Transformer backbone.
Our approach uses a voting mechanism to assess overlap scores for potential database images.
arXiv Detail & Related papers (2024-06-23T20:00:20Z) - Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth [56.565405280314884]
This paper focuses on improving the performance of a trained model in a new target area by leveraging only the target-area images without fine GT.
We propose a weakly supervised learning approach based on knowledge self-distillation.
Our approach is validated using two recent state-of-the-art models on two benchmarks.
arXiv Detail & Related papers (2024-06-01T15:58:35Z) - BEV-CV: Birds-Eye-View Transform for Cross-View Geo-Localisation [15.324623975476348]
Cross-view image matching for geo-localisation is a challenging problem due to the significant visual difference between aerial and ground-level viewpoints.
We propose BEV-CV, an approach introducing two key novelties with a focus on improving the real-world viability of cross-view geo-localisation.
arXiv Detail & Related papers (2023-12-23T22:20:45Z) - Visual Localization via Few-Shot Scene Region Classification [84.34083435501094]
Visual (re)localization addresses the problem of estimating the 6-DoF camera pose of a query image captured in a known scene.
Recent advances in structure-based localization solve this problem by memorizing the mapping from image pixels to scene coordinates.
We propose a scene region classification approach to achieve fast and effective scene memorization with few-shot images.
arXiv Detail & Related papers (2022-08-14T22:39:02Z) - Satellite Image Based Cross-view Localization for Autonomous Vehicle [59.72040418584396]
This paper shows that by using an off-the-shelf high-definition satellite image as a ready-to-use map, we are able to achieve cross-view vehicle localization up to a satisfactory accuracy.
Our method is validated on KITTI and Ford Multi-AV Seasonal datasets as ground view and Google Maps as the satellite view.
arXiv Detail & Related papers (2022-07-27T13:16:39Z) - "The Pedestrian next to the Lamppost" Adaptive Object Graphs for Better
Instantaneous Mapping [45.94778766867247]
Estimating a semantically segmented bird's-eye-view map from a single image has become a popular technique for autonomous control and navigation.
We show an increase in localization error with distance from the camera.
We propose a graph neural network which predicts BEV objects from a monocular image by spatially reasoning about an object within the context of other objects.
arXiv Detail & Related papers (2022-04-06T17:23:13Z) - Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD
Images [69.5662419067878]
Grounding referring expressions in RGBD image has been an emerging field.
We present a novel task of 3D visual grounding in single-view RGBD image where the referred objects are often only partially scanned due to occlusion.
Our approach first fuses the language and the visual features at the bottom level to generate a heatmap that localizes the relevant regions in the RGBD image.
Then our approach conducts an adaptive feature learning based on the heatmap and performs the object-level matching with another visio-linguistic fusion to finally ground the referred object.
arXiv Detail & Related papers (2021-03-14T11:18:50Z) - VIGOR: Cross-View Image Geo-localization beyond One-to-one Retrieval [19.239311087570318]
Cross-view image geo-localization aims to determine the locations of street-view query images by matching with GPS-tagged reference images from aerial view.
Recent works have achieved surprisingly high retrieval accuracy on city-scale datasets.
We propose a new large-scale benchmark -- VIGOR -- for cross-View Image Geo-localization beyond One-to-one Retrieval.
arXiv Detail & Related papers (2020-11-24T15:50:54Z) - Where am I looking at? Joint Location and Orientation Estimation by
Cross-View Matching [95.64702426906466]
Cross-view geo-localization is a problem given a large-scale database of geo-tagged aerial images.
Knowing orientation between ground and aerial images can significantly reduce matching ambiguity between these two views.
We design a Dynamic Similarity Matching network to estimate cross-view orientation alignment during localization.
arXiv Detail & Related papers (2020-05-08T05:21:16Z) - Cross-View Image Retrieval -- Ground to Aerial Image Retrieval through
Deep Learning [3.326320568999945]
We present a novel cross-modal retrieval method specifically for multi-view images, called Cross-view Image Retrieval CVIR.
Our approach aims to find a feature space as well as an embedding space in which samples from street-view images are compared directly to satellite-view images.
For this comparison, a novel deep metric learning based solution "DeepCVIR" has been proposed.
arXiv Detail & Related papers (2020-05-02T06:52:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.