ZOPP: A Framework of Zero-shot Offboard Panoptic Perception for Autonomous Driving
- URL: http://arxiv.org/abs/2411.05311v1
- Date: Fri, 08 Nov 2024 03:52:32 GMT
- Title: ZOPP: A Framework of Zero-shot Offboard Panoptic Perception for Autonomous Driving
- Authors: Tao Ma, Hongbin Zhou, Qiusheng Huang, Xuemeng Yang, Jianfei Guo, Bo Zhang, Min Dou, Yu Qiao, Botian Shi, Hongsheng Li,
- Abstract summary: Offboard perception aims to automatically generate high-quality 3D labels for autonomous driving scenes.
We propose a novel Zero-shot Offboard Panoptic Perception (ZOPP) framework for autonomous driving scenes.
ZOPP integrates the powerful zero-shot recognition capabilities of vision foundation models and 3D representations derived from point clouds.
- Score: 44.174489160967056
- License:
- Abstract: Offboard perception aims to automatically generate high-quality 3D labels for autonomous driving (AD) scenes. Existing offboard methods focus on 3D object detection with closed-set taxonomy and fail to match human-level recognition capability on the rapidly evolving perception tasks. Due to heavy reliance on human labels and the prevalence of data imbalance and sparsity, a unified framework for offboard auto-labeling various elements in AD scenes that meets the distinct needs of perception tasks is not being fully explored. In this paper, we propose a novel multi-modal Zero-shot Offboard Panoptic Perception (ZOPP) framework for autonomous driving scenes. ZOPP integrates the powerful zero-shot recognition capabilities of vision foundation models and 3D representations derived from point clouds. To the best of our knowledge, ZOPP represents a pioneering effort in the domain of multi-modal panoptic perception and auto labeling for autonomous driving scenes. We conduct comprehensive empirical studies and evaluations on Waymo open dataset to validate the proposed ZOPP on various perception tasks. To further explore the usability and extensibility of our proposed ZOPP, we also conduct experiments in downstream applications. The results further demonstrate the great potential of our ZOPP for real-world scenarios.
Related papers
- A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions [11.071271817366739]
3D object perception has become a crucial component in the development of autonomous driving systems.
This review extensively summarizes traditional 3D object detection methods, focusing on camera-based, LiDAR-based, and fusion detection techniques.
We discuss future directions, including methods to improve accuracy such as temporal perception, occupancy grids, and end-to-end learning frameworks.
arXiv Detail & Related papers (2024-08-28T01:08:33Z) - 3D Object Visibility Prediction in Autonomous Driving [6.802572869909114]
We present a novel attribute and its corresponding algorithm: 3D object visibility.
Our proposal of this attribute and its computational strategy aims to expand the capabilities for downstream tasks.
arXiv Detail & Related papers (2024-03-06T13:07:42Z) - Unsupervised 3D Perception with 2D Vision-Language Distillation for
Autonomous Driving [39.70689418558153]
We present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels.
Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants.
arXiv Detail & Related papers (2023-09-25T19:33:52Z) - A Simple Framework for 3D Occupancy Estimation in Autonomous Driving [16.605853706182696]
We present a CNN-based framework designed to reveal several key factors for 3D occupancy estimation.
We also explore the relationship between 3D occupancy estimation and other related tasks, such as monocular depth estimation and 3D reconstruction.
arXiv Detail & Related papers (2023-03-17T15:57:14Z) - HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for
Autonomous Driving [95.42203932627102]
3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians.
Our method efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin.
Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages.
arXiv Detail & Related papers (2022-12-15T11:15:14Z) - 3D Object Detection for Autonomous Driving: A Comprehensive Survey [48.30753402458884]
3D object detection, which intelligently predicts the locations, sizes, and categories of the critical 3D objects near an autonomous vehicle, is an important part of a perception system.
This paper reviews the advances in 3D object detection for autonomous driving.
arXiv Detail & Related papers (2022-06-19T19:43:11Z) - A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - SurroundDepth: Entangling Surrounding Views for Self-Supervised
Multi-Camera Depth Estimation [101.55622133406446]
We propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.
In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets.
arXiv Detail & Related papers (2022-04-07T17:58:47Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.