Open-Vocabulary Point-Cloud Object Detection without 3D Annotation
- URL: http://arxiv.org/abs/2304.00788v2
- Date: Wed, 17 May 2023 02:09:03 GMT
- Title: Open-Vocabulary Point-Cloud Object Detection without 3D Annotation
- Authors: Yuheng Lu, Chenfeng Xu, Xiaobao Wei, Xiaodong Xie, Masayoshi Tomizuka,
Kurt Keutzer, Shanghang Zhang
- Abstract summary: The goal of open-vocabulary 3D point-cloud detection is to identify novel objects based on arbitrary textual descriptions.
We develop a point-cloud detector that can learn a general representation for localizing various objects.
We also propose a novel de-biased triplet cross-modal contrastive learning to connect the modalities of image, point-cloud and text.
- Score: 62.18197846270103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of open-vocabulary detection is to identify novel objects based on
arbitrary textual descriptions. In this paper, we address open-vocabulary 3D
point-cloud detection by a dividing-and-conquering strategy, which involves: 1)
developing a point-cloud detector that can learn a general representation for
localizing various objects, and 2) connecting textual and point-cloud
representations to enable the detector to classify novel object categories
based on text prompting. Specifically, we resort to rich image pre-trained
models, by which the point-cloud detector learns localizing objects under the
supervision of predicted 2D bounding boxes from 2D pre-trained detectors.
Moreover, we propose a novel de-biased triplet cross-modal contrastive learning
to connect the modalities of image, point-cloud and text, thereby enabling the
point-cloud detector to benefit from vision-language pre-trained
models,i.e.,CLIP. The novel use of image and vision-language pre-trained models
for point-cloud detectors allows for open-vocabulary 3D object detection
without the need for 3D annotations. Experiments demonstrate that the proposed
method improves at least 3.03 points and 7.47 points over a wide range of
baselines on the ScanNet and SUN RGB-D datasets, respectively. Furthermore, we
provide a comprehensive analysis to explain why our approach works.
Related papers
- Open Vocabulary Monocular 3D Object Detection [10.424711580213616]
We pioneer the study of open-vocabulary monocular 3D object detection, a novel task that aims to detect and localize objects in 3D space from a single RGB image.
We introduce a class-agnostic approach that leverages open-vocabulary 2D detectors and lifts 2D bounding boxes into 3D space.
Our approach decouples the recognition and localization of objects in 2D from the task of estimating 3D bounding boxes, enabling generalization across unseen categories.
arXiv Detail & Related papers (2024-11-25T18:59:17Z) - Objects as Spatio-Temporal 2.5D points [5.588892124219713]
We propose a weakly supervised method to estimate 3D position of objects by jointly learning to regress the 2D object detections scene's depth prediction in a single feed-forward pass of a network.
Our proposed method extends a single-point based object detector, and introduces a novel object representation where each object is modeled as a BEV point-temporally, without the need of any 3D or BEV annotations for training and LiDAR data at query time.
arXiv Detail & Related papers (2022-12-06T05:14:30Z) - AGO-Net: Association-Guided 3D Point Cloud Object Detection Network [86.10213302724085]
We propose a novel 3D detection framework that associates intact features for objects via domain adaptation.
We achieve new state-of-the-art performance on the KITTI 3D detection benchmark in both accuracy and speed.
arXiv Detail & Related papers (2022-08-24T16:54:38Z) - ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object
Detection [114.54835359657707]
ProposalContrast is an unsupervised point cloud pre-training framework.
It learns robust 3D representations by contrasting region proposals.
ProposalContrast is verified on various 3D detectors.
arXiv Detail & Related papers (2022-07-26T04:45:49Z) - Open-Vocabulary 3D Detection via Image-level Class and Debiased
Cross-modal Contrastive Learning [62.18197846270103]
Current point-cloud detection methods have difficulty detecting the open-vocabulary objects in the real world.
We propose OV-3DETIC, an Open-Vocabulary 3D DETector using Image-level Class supervision.
arXiv Detail & Related papers (2022-07-05T12:13:52Z) - 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive
Selection [35.5386998382886]
3D visual grounding aims to locate the referred target object in 3D point cloud scenes according to a free-form language description.
Previous methods mostly follow a two-stage paradigm, i.e., language-irrelevant detection and cross-modal matching.
We propose a 3D Single-Stage Referred Point Progressive Selection method, which progressively selects keypoints with the guidance of language and directly locates the target.
arXiv Detail & Related papers (2022-04-13T09:46:27Z) - SASA: Semantics-Augmented Set Abstraction for Point-based 3D Object
Detection [78.90102636266276]
We propose a novel set abstraction method named Semantics-Augmented Set Abstraction (SASA)
Based on the estimated point-wise foreground scores, we then propose a semantics-guided point sampling algorithm to help retain more important foreground points during down-sampling.
In practice, SASA shows to be effective in identifying valuable points related to foreground objects and improving feature learning for point-based 3D detection.
arXiv Detail & Related papers (2022-01-06T08:54:47Z) - Cross-Modality 3D Object Detection [63.29935886648709]
We present a novel two-stage multi-modal fusion network for 3D object detection.
The whole architecture facilitates two-stage fusion.
Our experiments on the KITTI dataset show that the proposed multi-stage fusion helps the network to learn better representations.
arXiv Detail & Related papers (2020-08-16T11:01:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.