Open-Vocabulary 3D Detection via Image-level Class and Debiased
Cross-modal Contrastive Learning
- URL: http://arxiv.org/abs/2207.01987v1
- Date: Tue, 5 Jul 2022 12:13:52 GMT
- Title: Open-Vocabulary 3D Detection via Image-level Class and Debiased
Cross-modal Contrastive Learning
- Authors: Yuheng Lu, Chenfeng Xu, Xiaobao Wei, Xiaodong Xie, Masayoshi Tomizuka,
Kurt Keutzer, Shanghang Zhang
- Abstract summary: Current point-cloud detection methods have difficulty detecting the open-vocabulary objects in the real world.
We propose OV-3DETIC, an Open-Vocabulary 3D DETector using Image-level Class supervision.
- Score: 62.18197846270103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current point-cloud detection methods have difficulty detecting the
open-vocabulary objects in the real world, due to their limited generalization
capability. Moreover, it is extremely laborious and expensive to collect and
fully annotate a point-cloud detection dataset with numerous classes of
objects, leading to the limited classes of existing point-cloud datasets and
hindering the model to learn general representations to achieve open-vocabulary
point-cloud detection. As far as we know, we are the first to study the problem
of open-vocabulary 3D point-cloud detection. Instead of seeking a point-cloud
dataset with full labels, we resort to ImageNet1K to broaden the vocabulary of
the point-cloud detector. We propose OV-3DETIC, an Open-Vocabulary 3D DETector
using Image-level Class supervision. Specifically, we take advantage of two
modalities, the image modality for recognition and the point-cloud modality for
localization, to generate pseudo labels for unseen classes. Then we propose a
novel debiased cross-modal contrastive learning method to transfer the
knowledge from image modality to point-cloud modality during training. Without
hurting the latency during inference, OV-3DETIC makes the point-cloud detector
capable of achieving open-vocabulary detection. Extensive experiments
demonstrate that the proposed OV-3DETIC achieves at least 10.77 % mAP
improvement (absolute value) and 9.56 % mAP improvement (absolute value) by a
wide range of baselines on the SUN-RGBD dataset and ScanNet dataset,
respectively. Besides, we conduct sufficient experiments to shed light on why
the proposed OV-3DETIC works.
Related papers
- Open-Vocabulary Point-Cloud Object Detection without 3D Annotation [62.18197846270103]
The goal of open-vocabulary 3D point-cloud detection is to identify novel objects based on arbitrary textual descriptions.
We develop a point-cloud detector that can learn a general representation for localizing various objects.
We also propose a novel de-biased triplet cross-modal contrastive learning to connect the modalities of image, point-cloud and text.
arXiv Detail & Related papers (2023-04-03T08:22:02Z) - Exploring Active 3D Object Detection from a Generalization Perspective [58.597942380989245]
Uncertainty-based active learning policies fail to balance the trade-off between point cloud informativeness and box-level annotation costs.
We propose textscCrb, which hierarchically filters out the point clouds of redundant 3D bounding box labels.
Experiments show that the proposed approach outperforms existing active learning strategies.
arXiv Detail & Related papers (2023-01-23T02:43:03Z) - Unleash the Potential of Image Branch for Cross-modal 3D Object
Detection [67.94357336206136]
We present a new cross-modal 3D object detector, namely UPIDet, which aims to unleash the potential of the image branch from two aspects.
First, UPIDet introduces a new 2D auxiliary task called normalized local coordinate map estimation.
Second, we discover that the representational capability of the point cloud backbone can be enhanced through the gradients backpropagated from the training objectives of the image branch.
arXiv Detail & Related papers (2023-01-22T08:26:58Z) - Data Augmentation-free Unsupervised Learning for 3D Point Cloud
Understanding [61.30276576646909]
We propose an augmentation-free unsupervised approach for point clouds to learn transferable point-level features via soft clustering, named SoftClu.
We exploit the affiliation of points to their clusters as a proxy to enable self-training through a pseudo-label prediction task.
arXiv Detail & Related papers (2022-10-06T10:18:16Z) - Boosting 3D Object Detection by Simulating Multimodality on Point Clouds [51.87740119160152]
This paper presents a new approach to boost a single-modality (LiDAR) 3D object detector by teaching it to simulate features and responses that follow a multi-modality (LiDAR-image) detector.
The approach needs LiDAR-image data only when training the single-modality detector, and once well-trained, it only needs LiDAR data at inference.
Experimental results on the nuScenes dataset show that our approach outperforms all SOTA LiDAR-only 3D detectors.
arXiv Detail & Related papers (2022-06-30T01:44:30Z) - SA-Det3D: Self-Attention Based Context-Aware 3D Object Detection [9.924083358178239]
We propose two variants of self-attention for contextual modeling in 3D object detection.
We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors.
Next, we propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations.
arXiv Detail & Related papers (2021-01-07T18:30:32Z) - R-AGNO-RPN: A LIDAR-Camera Region Deep Network for Resolution-Agnostic
Detection [3.4761212729163313]
R-AGNO-RPN, a region proposal network built on fusion of 3D point clouds and RGB images is proposed.
Our approach is designed to be also applied on low point cloud resolutions.
arXiv Detail & Related papers (2020-12-10T15:22:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.