Object2Scene: Putting Objects in Context for Open-Vocabulary 3D
Detection
- URL: http://arxiv.org/abs/2309.09456v1
- Date: Mon, 18 Sep 2023 03:31:53 GMT
- Title: Object2Scene: Putting Objects in Context for Open-Vocabulary 3D
Detection
- Authors: Chenming Zhu, Wenwei Zhang, Tai Wang, Xihui Liu and Kai Chen
- Abstract summary: Point cloud-based open-vocabulary 3D object detection aims to detect 3D categories that do not have ground-truth annotations in the training set.
Previous approaches leverage large-scale richly-annotated image datasets as a bridge between 3D and category semantics.
We propose Object2Scene, the first approach that leverages large-scale large-vocabulary 3D object datasets to augment existing 3D scene datasets for open-vocabulary 3D object detection.
- Score: 24.871590175483096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Point cloud-based open-vocabulary 3D object detection aims to detect 3D
categories that do not have ground-truth annotations in the training set. It is
extremely challenging because of the limited data and annotations (bounding
boxes with class labels or text descriptions) of 3D scenes. Previous approaches
leverage large-scale richly-annotated image datasets as a bridge between 3D and
category semantics but require an extra alignment process between 2D images and
3D points, limiting the open-vocabulary ability of 3D detectors. Instead of
leveraging 2D images, we propose Object2Scene, the first approach that
leverages large-scale large-vocabulary 3D object datasets to augment existing
3D scene datasets for open-vocabulary 3D object detection. Object2Scene inserts
objects from different sources into 3D scenes to enrich the vocabulary of 3D
scene datasets and generates text descriptions for the newly inserted objects.
We further introduce a framework that unifies 3D detection and visual
grounding, named L3Det, and propose a cross-domain category-level contrastive
learning approach to mitigate the domain gap between 3D objects from different
datasets. Extensive experiments on existing open-vocabulary 3D object detection
benchmarks show that Object2Scene obtains superior performance over existing
methods. We further verify the effectiveness of Object2Scene on a new benchmark
OV-ScanNet-200, by holding out all rare categories as novel categories not seen
during training.
Related papers
- Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling [9.440800948514449]
We propose a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling.
Our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images.
We design an edge self-attention based graph neural network to generate scene graphs of 3D point cloud scenes.
arXiv Detail & Related papers (2024-04-03T07:30:09Z) - InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes [86.26588382747184]
We introduce InseRF, a novel method for generative object insertion in the NeRF reconstructions of 3D scenes.
Based on a user-provided textual description and a 2D bounding box in a reference viewpoint, InseRF generates new objects in 3D scenes.
arXiv Detail & Related papers (2024-01-10T18:59:53Z) - Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
We propose a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment.
Our 3D-VLA exploits the superior ability of current large-scale vision-language models on aligning the semantics between texts and 2D images.
During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object
Detection [17.526914782562528]
We propose Graph-DETR3D to automatically aggregate multi-view imagery information through graph structure learning (GSL)
Our best model achieves 49.5 NDS on the nuScenes test leaderboard, achieving new state-of-the-art in comparison with various published image-view 3D object detectors.
arXiv Detail & Related papers (2022-04-25T12:10:34Z) - HyperDet3D: Learning a Scene-conditioned 3D Object Detector [154.84798451437032]
We propose HyperDet3D to explore scene-conditioned prior knowledge for 3D object detection.
Our HyperDet3D achieves state-of-the-art results on the 3D object detection benchmark of the ScanNet and SUN RGB-D datasets.
arXiv Detail & Related papers (2022-04-12T07:57:58Z) - Point2Seq: Detecting 3D Objects as Sequences [58.63662049729309]
We present a simple and effective framework, named Point2Seq, for 3D object detection from point clouds.
We view each 3D object as a sequence of words and reformulate the 3D object detection task as decoding words from 3D scenes in an auto-regressive manner.
arXiv Detail & Related papers (2022-03-25T00:20:31Z) - Weakly Supervised 3D Object Detection from Point Clouds [27.70180601788613]
3D object detection aims to detect and localize the 3D bounding boxes of objects belonging to specific classes.
Existing 3D object detectors rely on annotated 3D bounding boxes during training, while these annotations could be expensive to obtain and only accessible in limited scenarios.
We propose VS3D, a framework for weakly supervised 3D object detection from point clouds without using any ground truth 3D bounding box for training.
arXiv Detail & Related papers (2020-07-28T03:30:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.