LiDARFormer: A Unified Transformer-based Multi-task Network for LiDAR
Perception
- URL: http://arxiv.org/abs/2303.12194v2
- Date: Sat, 2 Mar 2024 22:18:12 GMT
- Title: LiDARFormer: A Unified Transformer-based Multi-task Network for LiDAR
Perception
- Authors: Zixiang Zhou, Dongqiangzi Ye, Weijia Chen, Yufei Xie, Yu Wang, Panqu
Wang, Hassan Foroosh
- Abstract summary: We introduce a new LiDAR multi-task learning paradigm based on the transformer.
LiDARFormer exploits cross-task synergy to boost the performance of LiDAR perception tasks.
LiDARFormer is evaluated on the large-scale nuScenes and the Open datasets for both 3D detection and semantic segmentation tasks.
- Score: 15.919789515451615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is a recent trend in the LiDAR perception field towards unifying
multiple tasks in a single strong network with improved performance, as opposed
to using separate networks for each task. In this paper, we introduce a new
LiDAR multi-task learning paradigm based on the transformer. The proposed
LiDARFormer utilizes cross-space global contextual feature information and
exploits cross-task synergy to boost the performance of LiDAR perception tasks
across multiple large-scale datasets and benchmarks. Our novel
transformer-based framework includes a cross-space transformer module that
learns attentive features between the 2D dense Bird's Eye View (BEV) and 3D
sparse voxel feature maps. Additionally, we propose a transformer decoder for
the segmentation task to dynamically adjust the learned features by leveraging
the categorical feature representations. Furthermore, we combine the
segmentation and detection features in a shared transformer decoder with
cross-task attention layers to enhance and integrate the object-level and
class-level features. LiDARFormer is evaluated on the large-scale nuScenes and
the Waymo Open datasets for both 3D detection and semantic segmentation tasks,
and it outperforms all previously published methods on both tasks. Notably,
LiDARFormer achieves the state-of-the-art performance of 76.4% L2 mAPH and
74.3% NDS on the challenging Waymo and nuScenes detection benchmarks for a
single model LiDAR-only method.
Related papers
- LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets.
Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples.
Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z) - LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes [55.33167217384738]
LiMoE is a framework that integrates the Mixture of Experts (MoE) paradigm into LiDAR data representation learning.
Our approach consists of three stages: Image-to-LiDAR Pretraining, Contrastive Mixture Learning (CML), and Semantic Mixture Supervision (SMS)
arXiv Detail & Related papers (2025-01-07T18:59:58Z) - SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.
It is designed to accurately detect horizontal or oriented objects from any sensor modality.
This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - A Point-Based Approach to Efficient LiDAR Multi-Task Perception [49.91741677556553]
PAttFormer is an efficient multi-task architecture for joint semantic segmentation and object detection in point clouds.
Unlike other LiDAR-based multi-task architectures, our proposed PAttFormer does not require separate feature encoders for task-specific point cloud representations.
Our evaluations show substantial gains from multi-task learning, improving LiDAR semantic segmentation by +1.7% in mIou and 3D object detection by +1.7% in mAP.
arXiv Detail & Related papers (2024-04-19T11:24:34Z) - Small, Versatile and Mighty: A Range-View Perception Framework [13.85089181673372]
We propose a novel multi-task framework for 3D detection of LiDAR data.
Our framework integrates semantic segmentation and panoptic segmentation tasks for the LiDAR point cloud.
Among range-view-based methods, our model achieves new state-of-the-art detection performances on the Open dataset.
arXiv Detail & Related papers (2024-03-01T07:02:42Z) - LiDAR-BEVMTN: Real-Time LiDAR Bird's-Eye View Multi-Task Perception Network for Autonomous Driving [12.713417063678335]
We present a real-time multi-task convolutional neural network for LiDAR-based object detection, semantics, and motion segmentation.
We propose a novel Semantic Weighting and Guidance (SWAG) module to transfer semantic features for improved object detection selectively.
We achieve state-of-the-art results for two tasks, semantic and motion segmentation, and close to state-of-the-art performance for 3D object detection.
arXiv Detail & Related papers (2023-07-17T21:22:17Z) - AOP-Net: All-in-One Perception Network for Joint LiDAR-based 3D Object
Detection and Panoptic Segmentation [9.513467995188634]
AOP-Net is a LiDAR-based multi-task framework that combines 3D object detection and panoptic segmentation.
The AOP-Net achieves state-of-the-art performance for published works on the nuScenes benchmark for both 3D object detection and panoptic segmentation tasks.
arXiv Detail & Related papers (2023-02-02T05:31:53Z) - LidarMultiNet: Towards a Unified Multi-Task Network for LiDAR Perception [15.785527155108966]
LidarMultiNet is a LiDAR-based multi-task network that unifies 3D object detection, semantic segmentation, and panoptic segmentation.
At the core of LidarMultiNet is a strong 3D voxel-based encoder-decoder architecture with a Global Context Pooling (GCP) module.
LidarMultiNet is extensively tested on both Open dataset and nuScenes dataset.
arXiv Detail & Related papers (2022-09-19T23:39:15Z) - Boosting 3D Object Detection by Simulating Multimodality on Point Clouds [51.87740119160152]
This paper presents a new approach to boost a single-modality (LiDAR) 3D object detector by teaching it to simulate features and responses that follow a multi-modality (LiDAR-image) detector.
The approach needs LiDAR-image data only when training the single-modality detector, and once well-trained, it only needs LiDAR data at inference.
Experimental results on the nuScenes dataset show that our approach outperforms all SOTA LiDAR-only 3D detectors.
arXiv Detail & Related papers (2022-06-30T01:44:30Z) - The Devil is in the Task: Exploiting Reciprocal Appearance-Localization
Features for Monocular 3D Object Detection [62.1185839286255]
Low-cost monocular 3D object detection plays a fundamental role in autonomous driving.
We introduce a Dynamic Feature Reflecting Network, named DFR-Net.
We rank 1st among all the monocular 3D object detectors in the KITTI test set.
arXiv Detail & Related papers (2021-12-28T07:31:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.