CVTNet: A Cross-View Transformer Network for Place Recognition Using
LiDAR Data
- URL: http://arxiv.org/abs/2302.01665v2
- Date: Fri, 6 Oct 2023 06:26:34 GMT
- Title: CVTNet: A Cross-View Transformer Network for Place Recognition Using
LiDAR Data
- Authors: Junyi Ma, Guangming Xiong, Jingyi Xu, Xieyuanli Chen
- Abstract summary: We propose a cross-view transformer-based network, dubbedBITNet, to fuse the range image views (RIVs) and bird's eye views (BEVs) generated from the LiDAR data.
We evaluate our approach on three datasets collected with different sensor setups and environmental conditions.
- Score: 15.144590078316252
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LiDAR-based place recognition (LPR) is one of the most crucial components of
autonomous vehicles to identify previously visited places in GPS-denied
environments. Most existing LPR methods use mundane representations of the
input point cloud without considering different views, which may not fully
exploit the information from LiDAR sensors. In this paper, we propose a
cross-view transformer-based network, dubbed CVTNet, to fuse the range image
views (RIVs) and bird's eye views (BEVs) generated from the LiDAR data. It
extracts correlations within the views themselves using intra-transformers and
between the two different views using inter-transformers. Based on that, our
proposed CVTNet generates a yaw-angle-invariant global descriptor for each
laser scan end-to-end online and retrieves previously seen places by descriptor
matching between the current query scan and the pre-built database. We evaluate
our approach on three datasets collected with different sensor setups and
environmental conditions. The experimental results show that our method
outperforms the state-of-the-art LPR methods with strong robustness to
viewpoint changes and long-time spans. Furthermore, our approach has a good
real-time performance that can run faster than the typical LiDAR frame rate.
The implementation of our method is released as open source at:
https://github.com/BIT-MJY/CVTNet.
Related papers
- Range and Bird's Eye View Fused Cross-Modal Visual Place Recognition [10.086473917830112]
Image-to-point cloud cross-modal Visual Place Recognition (VPR) is a challenging task where the query is an RGB image, and the database samples are LiDAR point clouds.
We propose an innovative initial retrieval + re-rank method that effectively combines information from range (or RGB) images and Bird's Eye View (BEV) images.
arXiv Detail & Related papers (2025-02-17T12:29:26Z) - RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion [58.77329237533034]
We propose a Radar-Camera fusion transformer (RaCFormer) to boost the accuracy of 3D object detection.
RaCFormer achieves superior results of 64.9% mAP and 70.2% NDS on nuScenes, even outperforming several LiDAR-based detectors.
arXiv Detail & Related papers (2024-12-17T09:47:48Z) - SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection [5.36022165180739]
We present SpaRC, a novel Sparse fusion transformer for 3D perception that integrates multi-view image semantics with Radar and Camera point features.
Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate that SpaRC significantly outperforms existing dense BEV-based and sparse query-based detectors.
arXiv Detail & Related papers (2024-11-29T17:17:38Z) - OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer [63.141027246418]
We propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency.
We provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment.
Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark.
arXiv Detail & Related papers (2024-07-15T12:15:27Z) - UnLoc: A Universal Localization Method for Autonomous Vehicles using
LiDAR, Radar and/or Camera Input [51.150605800173366]
UnLoc is a novel unified neural modeling approach for localization with multi-sensor input in all weather conditions.
Our method is extensively evaluated on Oxford Radar RobotCar, ApolloSouthBay and Perth-WA datasets.
arXiv Detail & Related papers (2023-07-03T04:10:55Z) - PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer [75.2251801053839]
We present a novel Point-Voxel Transformer for single-stage 3D detection (PVT-SSD)
We propose a Point-Voxel Transformer (PVT) module that obtains long-range contexts in a cheap manner from voxels.
The experiments on several autonomous driving benchmarks verify the effectiveness and efficiency of the proposed method.
arXiv Detail & Related papers (2023-05-11T07:37:15Z) - SeqOT: A Spatial-Temporal Transformer Network for Place Recognition
Using Sequential LiDAR Data [9.32516766412743]
We propose a transformer-based network named SeqOT to exploit the temporal and spatial information provided by sequential range images.
We evaluate our approach on four datasets collected with different types of LiDAR sensors in different environments.
Our method operates online faster than the frame rate of the sensor.
arXiv Detail & Related papers (2022-09-16T14:08:11Z) - Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images [96.66271207089096]
FCOS-LiDAR is a fully convolutional one-stage 3D object detector for LiDAR point clouds of autonomous driving scenes.
We show that an RV-based 3D detector with standard 2D convolutions alone can achieve comparable performance to state-of-the-art BEV-based detectors.
arXiv Detail & Related papers (2022-05-27T05:42:16Z) - Cycle and Semantic Consistent Adversarial Domain Adaptation for Reducing
Simulation-to-Real Domain Shift in LiDAR Bird's Eye View [110.83289076967895]
We present a BEV domain adaptation method based on CycleGAN that uses prior semantic classification in order to preserve the information of small objects of interest during the domain adaptation process.
The quality of the generated BEVs has been evaluated using a state-of-the-art 3D object detection framework at KITTI 3D Object Detection Benchmark.
arXiv Detail & Related papers (2021-04-22T12:47:37Z) - Multi-View Fusion of Sensor Data for Improved Perception and Prediction
in Autonomous Driving [11.312620949473938]
We present an end-to-end method for object detection and trajectory prediction utilizing multi-view representations of LiDAR and camera images.
Our model builds on a state-of-the-art Bird's-Eye View (BEV) network that fuses voxelized features from a sequence of historical LiDAR data.
We extend this model with additional LiDAR Range-View (RV) features that use the raw LiDAR information in its native, non-quantized representation.
arXiv Detail & Related papers (2020-08-27T03:32:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.