CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for
Open-vocabulary 3D Object Detection
- URL: http://arxiv.org/abs/2310.02960v1
- Date: Wed, 4 Oct 2023 16:50:51 GMT
- Title: CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for
Open-vocabulary 3D Object Detection
- Authors: Yang Cao, Yihan Zeng, Hang Xu, Dan Xu
- Abstract summary: Open-vocabulary 3D Object Detection (OV-3DDet) aims to detect objects from an arbitrary list of categories within a 3D scene, which remains seldom explored in the literature.
This paper aims at addressing the two problems simultaneously via a unified framework, under the condition of limited base categories.
To localize novel 3D objects, we propose an effective 3D Novel Object Discovery strategy, which utilizes both the 3D box geometry priors and 2D semantic open-vocabulary priors to generate pseudo box labels of the novel objects.
- Score: 38.144357345583664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary 3D Object Detection (OV-3DDet) aims to detect objects from an
arbitrary list of categories within a 3D scene, which remains seldom explored
in the literature. There are primarily two fundamental problems in OV-3DDet,
i.e., localizing and classifying novel objects. This paper aims at addressing
the two problems simultaneously via a unified framework, under the condition of
limited base categories. To localize novel 3D objects, we propose an effective
3D Novel Object Discovery strategy, which utilizes both the 3D box geometry
priors and 2D semantic open-vocabulary priors to generate pseudo box labels of
the novel objects. To classify novel object boxes, we further develop a
cross-modal alignment module based on discovered novel boxes, to align feature
spaces between 3D point cloud and image/text modalities. Specifically, the
alignment process contains a class-agnostic and a class-discriminative
alignment, incorporating not only the base objects with annotations but also
the increasingly discovered novel objects, resulting in an iteratively enhanced
alignment. The novel box discovery and crossmodal alignment are jointly learned
to collaboratively benefit each other. The novel object discovery can directly
impact the cross-modal alignment, while a better feature alignment can, in
turn, boost the localization capability, leading to a unified OV-3DDet
framework, named CoDA, for simultaneous novel object localization and
classification. Extensive experiments on two challenging datasets (i.e.,
SUN-RGBD and ScanNet) demonstrate the effectiveness of our method and also show
a significant mAP improvement upon the best-performing alternative method by
80%. Codes and pre-trained models are released on the project page.
Related papers
- Open Vocabulary Monocular 3D Object Detection [10.424711580213616]
We pioneer the study of open-vocabulary monocular 3D object detection, a novel task that aims to detect and localize objects in 3D space from a single RGB image.
We introduce a class-agnostic approach that leverages open-vocabulary 2D detectors and lifts 2D bounding boxes into 3D space.
Our approach decouples the recognition and localization of objects in 2D from the task of estimating 3D bounding boxes, enabling generalization across unseen categories.
arXiv Detail & Related papers (2024-11-25T18:59:17Z) - Syn-to-Real Unsupervised Domain Adaptation for Indoor 3D Object Detection [50.448520056844885]
We propose a novel framework for syn-to-real unsupervised domain adaptation in indoor 3D object detection.
Our adaptation results from synthetic dataset 3D-FRONT to real-world datasets ScanNetV2 and SUN RGB-D demonstrate remarkable mAP25 improvements of 9.7% and 9.1% over Source-Only baselines.
arXiv Detail & Related papers (2024-06-17T08:18:41Z) - Collaborative Novel Object Discovery and Box-Guided Cross-Modal Alignment for Open-Vocabulary 3D Object Detection [34.91703960513125]
CoDAv2 is a unified framework designed to tackle both the localization and classification of novel 3D objects.
CoDAv2 outperforms the best-performing method by a large margin.
Source code and pre-trained models are available at the GitHub project page.
arXiv Detail & Related papers (2024-06-02T18:32:37Z) - Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments [67.83787474506073]
We tackle the limitations of current LiDAR-based 3D object detection systems.
We introduce a universal textscFind n' Propagate approach for 3D OV tasks.
We achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes.
arXiv Detail & Related papers (2024-03-20T12:51:30Z) - Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance [49.14140194332482]
We introduce Open3DIS, a novel solution designed to tackle the problem of Open-Vocabulary Instance within 3D scenes.
Objects within 3D environments exhibit diverse shapes, scales, and colors, making precise instance-level identification a challenging task.
arXiv Detail & Related papers (2023-12-17T10:07:03Z) - OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object
Detection [41.24059083441953]
OpenSight is a more advanced 2D-3D modeling framework for LiDAR-based open-vocabulary detection.
Our method establishes state-of-the-art open-vocabulary performance on widely used 3D detection benchmarks.
arXiv Detail & Related papers (2023-12-12T07:49:30Z) - Revisiting Domain-Adaptive 3D Object Detection by Reliable, Diverse and
Class-balanced Pseudo-Labeling [38.07637524378327]
Unsupervised domain adaptation (DA) with the aid of pseudo labeling techniques has emerged as a crucial approach for domain-adaptive 3D object detection.
Existing DA methods suffer from a substantial drop in performance when applied to a multi-class training setting.
We propose a novel ReDB framework tailored for learning to detect all classes at once.
arXiv Detail & Related papers (2023-07-16T04:34:11Z) - NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization [80.3424839706698]
We present NeurOCS, a framework that uses instance masks 3D boxes as input to learn 3D object shapes by means of differentiable rendering.
Our approach rests on insights in learning a category-level shape prior directly from real driving scenes.
We make critical design choices to learn object coordinates more effectively from an object-centric view.
arXiv Detail & Related papers (2023-05-28T16:18:41Z) - Open-Vocabulary Point-Cloud Object Detection without 3D Annotation [62.18197846270103]
The goal of open-vocabulary 3D point-cloud detection is to identify novel objects based on arbitrary textual descriptions.
We develop a point-cloud detector that can learn a general representation for localizing various objects.
We also propose a novel de-biased triplet cross-modal contrastive learning to connect the modalities of image, point-cloud and text.
arXiv Detail & Related papers (2023-04-03T08:22:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.