CVTNet: A Cross-View Transformer Network for Place Recognition Using
LiDAR Data
- URL: http://arxiv.org/abs/2302.01665v2
- Date: Fri, 6 Oct 2023 06:26:34 GMT
- Title: CVTNet: A Cross-View Transformer Network for Place Recognition Using
LiDAR Data
- Authors: Junyi Ma, Guangming Xiong, Jingyi Xu, Xieyuanli Chen
- Abstract summary: We propose a cross-view transformer-based network, dubbedBITNet, to fuse the range image views (RIVs) and bird's eye views (BEVs) generated from the LiDAR data.
We evaluate our approach on three datasets collected with different sensor setups and environmental conditions.
- Score: 15.144590078316252
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LiDAR-based place recognition (LPR) is one of the most crucial components of
autonomous vehicles to identify previously visited places in GPS-denied
environments. Most existing LPR methods use mundane representations of the
input point cloud without considering different views, which may not fully
exploit the information from LiDAR sensors. In this paper, we propose a
cross-view transformer-based network, dubbed CVTNet, to fuse the range image
views (RIVs) and bird's eye views (BEVs) generated from the LiDAR data. It
extracts correlations within the views themselves using intra-transformers and
between the two different views using inter-transformers. Based on that, our
proposed CVTNet generates a yaw-angle-invariant global descriptor for each
laser scan end-to-end online and retrieves previously seen places by descriptor
matching between the current query scan and the pre-built database. We evaluate
our approach on three datasets collected with different sensor setups and
environmental conditions. The experimental results show that our method
outperforms the state-of-the-art LPR methods with strong robustness to
viewpoint changes and long-time spans. Furthermore, our approach has a good
real-time performance that can run faster than the typical LiDAR frame rate.
The implementation of our method is released as open source at:
https://github.com/BIT-MJY/CVTNet.
Related papers
- OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer [63.141027246418]
We propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency.
We provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment.
Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark.
arXiv Detail & Related papers (2024-07-15T12:15:27Z) - UnLoc: A Universal Localization Method for Autonomous Vehicles using
LiDAR, Radar and/or Camera Input [51.150605800173366]
UnLoc is a novel unified neural modeling approach for localization with multi-sensor input in all weather conditions.
Our method is extensively evaluated on Oxford Radar RobotCar, ApolloSouthBay and Perth-WA datasets.
arXiv Detail & Related papers (2023-07-03T04:10:55Z) - PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer [75.2251801053839]
We present a novel Point-Voxel Transformer for single-stage 3D detection (PVT-SSD)
We propose a Point-Voxel Transformer (PVT) module that obtains long-range contexts in a cheap manner from voxels.
The experiments on several autonomous driving benchmarks verify the effectiveness and efficiency of the proposed method.
arXiv Detail & Related papers (2023-05-11T07:37:15Z) - SeqOT: A Spatial-Temporal Transformer Network for Place Recognition
Using Sequential LiDAR Data [9.32516766412743]
We propose a transformer-based network named SeqOT to exploit the temporal and spatial information provided by sequential range images.
We evaluate our approach on four datasets collected with different types of LiDAR sensors in different environments.
Our method operates online faster than the frame rate of the sensor.
arXiv Detail & Related papers (2022-09-16T14:08:11Z) - Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images [96.66271207089096]
FCOS-LiDAR is a fully convolutional one-stage 3D object detector for LiDAR point clouds of autonomous driving scenes.
We show that an RV-based 3D detector with standard 2D convolutions alone can achieve comparable performance to state-of-the-art BEV-based detectors.
arXiv Detail & Related papers (2022-05-27T05:42:16Z) - Cycle and Semantic Consistent Adversarial Domain Adaptation for Reducing
Simulation-to-Real Domain Shift in LiDAR Bird's Eye View [110.83289076967895]
We present a BEV domain adaptation method based on CycleGAN that uses prior semantic classification in order to preserve the information of small objects of interest during the domain adaptation process.
The quality of the generated BEVs has been evaluated using a state-of-the-art 3D object detection framework at KITTI 3D Object Detection Benchmark.
arXiv Detail & Related papers (2021-04-22T12:47:37Z) - MVFuseNet: Improving End-to-End Object Detection and Motion Forecasting
through Multi-View Fusion of LiDAR Data [4.8061970432391785]
We propose itMVFusenet, a novel end-to-end method for joint object detection motion forecasting from a temporal sequence of LiDAR data.
We show the benefits of our multi-view approach for the tasks of detection and motion forecasting on two large-scale self-driving data sets.
arXiv Detail & Related papers (2021-04-21T21:29:08Z) - Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z) - Self-Supervised Adaptation for Video Super-Resolution [7.26562478548988]
Single-image super-resolution (SISR) networks can adapt their network parameters to specific input images.
We present a new learning algorithm that allows conventional video super-resolution (VSR) networks to adapt their parameters to test video frames.
arXiv Detail & Related papers (2021-03-18T08:30:24Z) - Multi-View Fusion of Sensor Data for Improved Perception and Prediction
in Autonomous Driving [11.312620949473938]
We present an end-to-end method for object detection and trajectory prediction utilizing multi-view representations of LiDAR and camera images.
Our model builds on a state-of-the-art Bird's-Eye View (BEV) network that fuses voxelized features from a sequence of historical LiDAR data.
We extend this model with additional LiDAR Range-View (RV) features that use the raw LiDAR information in its native, non-quantized representation.
arXiv Detail & Related papers (2020-08-27T03:32:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.