Language-Guided 3D Object Detection in Point Cloud for Autonomous
Driving
- URL: http://arxiv.org/abs/2305.15765v1
- Date: Thu, 25 May 2023 06:22:10 GMT
- Title: Language-Guided 3D Object Detection in Point Cloud for Autonomous
Driving
- Authors: Wenhao Cheng, Junbo Yin, Wei Li, Ruigang Yang and Jianbing Shen
- Abstract summary: We propose a new multi-modal visual grounding task, termed LiDAR Grounding.
It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector.
Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.
- Score: 91.91552963872596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the problem of 3D referring expression comprehension
(REC) in autonomous driving scenario, which aims to ground a natural language
to the targeted region in LiDAR point clouds. Previous approaches for REC
usually focus on the 2D or 3D-indoor domain, which is not suitable for
accurately predicting the location of the queried 3D region in an autonomous
driving scene. In addition, the upper-bound limitation and the heavy
computation cost motivate us to explore a better solution. In this work, we
propose a new multi-modal visual grounding task, termed LiDAR Grounding. Then
we devise a Multi-modal Single Shot Grounding (MSSG) approach with an effective
token fusion strategy. It jointly learns the LiDAR-based object detector with
the language features and predicts the targeted region directly from the
detector without any post-processing. Moreover, the image feature can be
flexibly integrated into our approach to provide rich texture and color
information. The cross-modal learning enforces the detector to concentrate on
important regions in the point cloud by considering the informative language
expressions, thus leading to much better accuracy and efficiency. Extensive
experiments on the Talk2Car dataset demonstrate the effectiveness of the
proposed methods. Our work offers a deeper insight into the LiDAR-based
grounding task and we expect it presents a promising direction for the
autonomous driving community.
Related papers
- Open 3D World in Autonomous Driving [6.876824330759794]
This paper presents a novel approach that integrates 3D point cloud data, acquired from LIDAR sensors, with textual information.
We introduce an efficient framework for the fusion of bird's-eye view (BEV) region features with textual features.
The effectiveness of the proposed methodology is rigorously evaluated through extensive experimentation on the newly introduced NuScenes-T dataset.
arXiv Detail & Related papers (2024-08-20T14:10:44Z) - OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments [77.0399450848749]
We propose an OccNeRF method for training occupancy networks without 3D supervision.
We parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range.
For semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model.
arXiv Detail & Related papers (2023-12-14T18:58:52Z) - ODM3D: Alleviating Foreground Sparsity for Semi-Supervised Monocular 3D
Object Detection [15.204935788297226]
ODM3D framework entails cross-modal knowledge distillation at various levels to inject LiDAR-domain knowledge into a monocular detector during training.
By identifying foreground sparsity as the main culprit behind existing methods' suboptimal training, we exploit the precise localisation information embedded in LiDAR points.
Our method ranks 1st in both KITTI validation and test benchmarks, significantly surpassing all existing monocular methods, supervised or semi-supervised.
arXiv Detail & Related papers (2023-10-28T07:12:09Z) - Unleash the Potential of Image Branch for Cross-modal 3D Object
Detection [67.94357336206136]
We present a new cross-modal 3D object detector, namely UPIDet, which aims to unleash the potential of the image branch from two aspects.
First, UPIDet introduces a new 2D auxiliary task called normalized local coordinate map estimation.
Second, we discover that the representational capability of the point cloud backbone can be enhanced through the gradients backpropagated from the training objectives of the image branch.
arXiv Detail & Related papers (2023-01-22T08:26:58Z) - PillarGrid: Deep Learning-based Cooperative Perception for 3D Object
Detection from Onboard-Roadside LiDAR [15.195933965761645]
We propose textitPillarGrid, a novel cooperative perception method fusing information from multiple 3D LiDARs.
PillarGrid consists of four main phases: 1) cooperative preprocessing of point clouds, 2) pillar-wise voxelization and feature extraction, 3) grid-wise deep fusion of features from multiple sensors, and 4) convolutional neural network (CNN)-based augmented 3D object detection.
Extensive experimentation shows that PillarGrid outperforms the SOTA single-LiDAR-based 3D object detection methods with respect to both accuracy and range by a large margin.
arXiv Detail & Related papers (2022-03-12T02:28:41Z) - The Devil is in the Task: Exploiting Reciprocal Appearance-Localization
Features for Monocular 3D Object Detection [62.1185839286255]
Low-cost monocular 3D object detection plays a fundamental role in autonomous driving.
We introduce a Dynamic Feature Reflecting Network, named DFR-Net.
We rank 1st among all the monocular 3D object detectors in the KITTI test set.
arXiv Detail & Related papers (2021-12-28T07:31:18Z) - LiDAR-based Panoptic Segmentation via Dynamic Shifting Network [56.71765153629892]
LiDAR-based panoptic segmentation aims to parse both objects and scenes in a unified manner.
We propose the Dynamic Shifting Network (DS-Net), which serves as an effective panoptic segmentation framework in the point cloud realm.
Our proposed DS-Net achieves superior accuracies over current state-of-the-art methods.
arXiv Detail & Related papers (2020-11-24T08:44:46Z) - InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic
Information Modeling [65.47126868838836]
We propose a novel 3D object detection framework with dynamic information modeling.
Coarse predictions are generated in the first stage via a voxel-based region proposal network.
Experiments are conducted on the large-scale nuScenes 3D detection benchmark.
arXiv Detail & Related papers (2020-07-16T18:27:08Z) - SegVoxelNet: Exploring Semantic Context and Depth-aware Features for 3D
Vehicle Detection from Point Cloud [39.99118618229583]
We propose a unified model SegVoxelNet to address the above two problems.
A semantic context encoder is proposed to leverage the free-of-charge semantic segmentation masks in the bird's eye view.
A novel depth-aware head is designed to explicitly model the distribution differences and each part of the depth-aware head is made to focus on its own target detection range.
arXiv Detail & Related papers (2020-02-13T02:42:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.