Unsupervised 3D Perception with 2D Vision-Language Distillation for
Autonomous Driving
- URL: http://arxiv.org/abs/2309.14491v1
- Date: Mon, 25 Sep 2023 19:33:52 GMT
- Title: Unsupervised 3D Perception with 2D Vision-Language Distillation for
Autonomous Driving
- Authors: Mahyar Najibi, Jingwei Ji, Yin Zhou, Charles R. Qi, Xinchen Yan, Scott
Ettinger, Dragomir Anguelov
- Abstract summary: We present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels.
Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants.
- Score: 39.70689418558153
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Closed-set 3D perception models trained on only a pre-defined set of object
categories can be inadequate for safety critical applications such as
autonomous driving where new object types can be encountered after deployment.
In this paper, we present a multi-modal auto labeling pipeline capable of
generating amodal 3D bounding boxes and tracklets for training models on
open-set categories without 3D human labels. Our pipeline exploits motion cues
inherent in point cloud sequences in combination with the freely available 2D
image-text pairs to identify and track all traffic participants. Compared to
the recent studies in this domain, which can only provide class-agnostic auto
labels limited to moving objects, our method can handle both static and moving
objects in the unsupervised manner and is able to output open-vocabulary
semantic labels thanks to the proposed vision-language knowledge distillation.
Experiments on the Waymo Open Dataset show that our approach outperforms the
prior work by significant margins on various unsupervised 3D perception tasks.
Related papers
- Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking [73.05477052645885]
We introduce open-vocabulary 3D tracking, which extends the scope of 3D tracking to include objects beyond predefined categories.
We propose a novel approach that integrates open-vocabulary capabilities into a 3D tracking framework, allowing for generalization to unseen object classes.
arXiv Detail & Related papers (2024-10-02T15:48:42Z) - Vision-Language Guidance for LiDAR-based Unsupervised 3D Object Detection [16.09503890891102]
We propose an unsupervised 3D detection approach that operates exclusively on LiDAR point clouds.
We exploit the inherent CLI-temporal knowledge of LiDAR point clouds for clustering, tracking, as well as boxtext and label refinement.
Our approach outperforms state-of-the-art unsupervised 3D object detectors on the Open dataset.
arXiv Detail & Related papers (2024-08-07T14:14:53Z) - Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving? [66.6886931183372]
We introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector.
Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks.
arXiv Detail & Related papers (2024-05-28T16:57:44Z) - Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance [72.6809373191638]
We propose a framework to study how to leverage constraints between 2D and 3D domains without requiring any 3D labels.
Specifically, we design a feature-level constraint to align LiDAR and image features based on object-aware regions.
Second, the output-level constraint is developed to enforce the overlap between 2D and projected 3D box estimations.
Third, the training-level constraint is utilized by producing accurate and consistent 3D pseudo-labels that align with the visual data.
arXiv Detail & Related papers (2023-12-12T18:57:25Z) - Every Dataset Counts: Scaling up Monocular 3D Object Detection with Joint Datasets Training [9.272389295055271]
This study investigates the pipeline for training a monocular 3D object detection model on a diverse collection of 3D and 2D datasets.
The proposed framework comprises three components: (1) a robust monocular 3D model capable of functioning across various camera settings, (2) a selective-training strategy to accommodate datasets with differing class annotations, and (3) a pseudo 3D training approach using 2D labels to enhance detection performance in scenes containing only 2D labels.
arXiv Detail & Related papers (2023-10-02T06:17:24Z) - View-to-Label: Multi-View Consistency for Self-Supervised 3D Object
Detection [46.077668660248534]
We propose a novel approach to self-supervise 3D object detection purely from RGB sequences alone.
Our experiments on KITTI 3D dataset demonstrate performance on par with state-of-the-art self-supervised methods.
arXiv Detail & Related papers (2023-05-29T09:30:39Z) - Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem.
We employ a Neural Message Passing network for data association that is fully trainable.
We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z) - Monocular Quasi-Dense 3D Object Tracking [99.51683944057191]
A reliable and accurate 3D tracking framework is essential for predicting future locations of surrounding objects and planning the observer's actions in numerous applications such as autonomous driving.
We propose a framework that can effectively associate moving objects over time and estimate their full 3D bounding box information from a sequence of 2D images captured on a moving platform.
arXiv Detail & Related papers (2021-03-12T15:30:02Z) - Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels [0.09558392439655011]
Training of 3D object detectors requires datasets with 3D bounding box labels for supervision that have to be generated by hand-labeling.
We propose a network architecture and training procedure for learning monocular 3D object detection without 3D bounding box labels.
We evaluate the proposed algorithm on the real-world KITTI dataset and achieve promising performance in comparison to state-of-the-art methods requiring 3D bounding box labels for training.
arXiv Detail & Related papers (2020-10-07T16:24:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.