BEV-Locator: An End-to-end Visual Semantic Localization Network Using
Multi-View Images
- URL: http://arxiv.org/abs/2211.14927v1
- Date: Sun, 27 Nov 2022 20:24:56 GMT
- Title: BEV-Locator: An End-to-end Visual Semantic Localization Network Using
Multi-View Images
- Authors: Zhihuang Zhang, Meng Xu, Wenqiang Zhou, Tao Peng, Liang Li, Stefan
Poslad
- Abstract summary: We propose an end-to-end visual semantic localization neural network using multi-view camera images.
The BEV-Locator is capable to estimate the vehicle poses under versatile scenarios.
Experiments report satisfactory accuracy with mean absolute errors of 0.052m, 0.135m and 0.251$circ$ in lateral, longitudinal translation and heading angle degree.
- Score: 13.258689143949912
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Accurate localization ability is fundamental in autonomous driving.
Traditional visual localization frameworks approach the semantic map-matching
problem with geometric models, which rely on complex parameter tuning and thus
hinder large-scale deployment. In this paper, we propose BEV-Locator: an
end-to-end visual semantic localization neural network using multi-view camera
images. Specifically, a visual BEV (Birds-Eye-View) encoder extracts and
flattens the multi-view images into BEV space. While the semantic map features
are structurally embedded as map queries sequence. Then a cross-model
transformer associates the BEV features and semantic map queries. The
localization information of ego-car is recursively queried out by
cross-attention modules. Finally, the ego pose can be inferred by decoding the
transformer outputs. We evaluate the proposed method in large-scale nuScenes
and Qcraft datasets. The experimental results show that the BEV-locator is
capable to estimate the vehicle poses under versatile scenarios, which
effectively associates the cross-model information from multi-view images and
global semantic maps. The experiments report satisfactory accuracy with mean
absolute errors of 0.052m, 0.135m and 0.251$^\circ$ in lateral, longitudinal
translation and heading angle degree.
Related papers
- OSMLoc: Single Image-Based Visual Localization in OpenStreetMap with Geometric and Semantic Guidances [11.085165252259042]
OSMLoc is a brain-inspired single-image visual localization method with semantic and geometric guidance to improve accuracy, robustness, and generalization ability.
To validate the proposed OSMLoc, we collect a worldwide cross-area and cross-condition (CC) benchmark for extensive evaluation.
arXiv Detail & Related papers (2024-11-13T14:59:00Z) - VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization [108.68014173017583]
Bird's-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car.
We propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space.
Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens
arXiv Detail & Related papers (2024-11-03T16:09:47Z) - Progressive Query Refinement Framework for Bird's-Eye-View Semantic Segmentation from Surrounding Images [3.495246564946556]
We introduce the Multi-Resolution (MR) concept into Bird's-Eye-View (BEV) semantic segmentation for autonomous driving.
We propose a visual feature interaction network that promotes interactions between features across images and across feature levels.
We evaluate our model on a large-scale real-world dataset.
arXiv Detail & Related papers (2024-07-24T05:00:31Z) - Breaking the Frame: Visual Place Recognition by Overlap Prediction [53.17564423756082]
We propose a novel visual place recognition approach based on overlap prediction, called VOP.
VOP proceeds co-visible image sections by obtaining patch-level embeddings using a Vision Transformer backbone.
Our approach uses a voting mechanism to assess overlap scores for potential database images.
arXiv Detail & Related papers (2024-06-23T20:00:20Z) - EgoVM: Achieving Precise Ego-Localization using Lightweight Vectorized
Maps [9.450650025266379]
We present EgoVM, an end-to-end localization network that achieves comparable localization accuracy to prior state-of-the-art methods.
We employ a set of learnable semantic embeddings to encode the semantic types of map elements and supervise them with semantic segmentation.
We adopt a robust histogram-based pose solver to estimate the optimal pose by searching exhaustively over candidate poses.
arXiv Detail & Related papers (2023-07-18T06:07:25Z) - LaRa: Latents and Rays for Multi-Camera Bird's-Eye-View Semantic
Segmentation [43.12994451281451]
We present 'LaRa', an efficient encoder-decoder, transformer-based model for vehicle semantic segmentation from multiple cameras.
Our approach uses a system of cross-attention to aggregate information over multiple sensors into a compact, yet rich, collection of latent representations.
arXiv Detail & Related papers (2022-06-27T13:37:50Z) - ViT-BEVSeg: A Hierarchical Transformer Network for Monocular
Birds-Eye-View Segmentation [2.70519393940262]
We evaluate the use of vision transformers (ViT) as a backbone architecture to generate Bird Eye View (BEV) maps.
Our network architecture, ViT-BEVSeg, employs standard vision transformers to generate a multi-scale representation of the input image.
We evaluate our approach on the nuScenes dataset demonstrating a considerable improvement relative to state-of-the-art approaches.
arXiv Detail & Related papers (2022-05-31T10:18:36Z) - GitNet: Geometric Prior-based Transformation for Birds-Eye-View
Segmentation [105.19949897812494]
Birds-eye-view (BEV) semantic segmentation is critical for autonomous driving.
We present a novel two-stage Geometry Prior-based Transformation framework named GitNet.
arXiv Detail & Related papers (2022-04-16T06:46:45Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.