ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved
Visio-Linguistic Models in 3D Scenes
- URL: http://arxiv.org/abs/2212.06250v2
- Date: Sat, 1 Apr 2023 12:13:27 GMT
- Title: ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved
Visio-Linguistic Models in 3D Scenes
- Authors: Ahmed Abdelreheem, Kyle Olszewski, Hsin-Ying Lee, Peter Wonka, Panos
Achlioptas
- Abstract summary: Scan Entities in 3D (ScanEnts3D) dataset provides explicit correspondences between 369k objects across 84k natural referential sentences.
We show that by incorporating intuitive losses that enable learning from this novel dataset, we can significantly improve the performance of several recently introduced neural listening architectures.
- Score: 48.65360357173095
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The two popular datasets ScanRefer [16] and ReferIt3D [3] connect natural
language to real-world 3D data. In this paper, we curate a large-scale and
complementary dataset extending both the aforementioned ones by associating all
objects mentioned in a referential sentence to their underlying instances
inside a 3D scene. Specifically, our Scan Entities in 3D (ScanEnts3D) dataset
provides explicit correspondences between 369k objects across 84k natural
referential sentences, covering 705 real-world scenes. Crucially, we show that
by incorporating intuitive losses that enable learning from this novel dataset,
we can significantly improve the performance of several recently introduced
neural listening architectures, including improving the SoTA in both the Nr3D
and ScanRefer benchmarks by 4.3% and 5.0%, respectively. Moreover, we
experiment with competitive baselines and recent methods for the task of
language generation and show that, as with neural listeners, 3D neural speakers
can also noticeably benefit by training with ScanEnts3D, including improving
the SoTA by 13.2 CIDEr points on the Nr3D benchmark. Overall, our carefully
conducted experimental studies strongly support the conclusion that, by
learning on ScanEnts3D, commonly used visio-linguistic 3D architectures can
become more efficient and interpretable in their generalization without needing
to provide these newly collected annotations at test time. The project's
webpage is https://scanents3d.github.io/ .
Related papers
- MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans [6.936271803454143]
We present a novel task for cross-dataset visual grounding in 3D scenes (Cross3DVG)
We created RIORefer, a large-scale 3D visual grounding dataset.
It includes more than 63k diverse descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan.
arXiv Detail & Related papers (2023-05-23T09:52:49Z) - OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic
Perception, Reconstruction and Generation [107.71752592196138]
We propose OmniObject3D, a large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects.
It comprises 6,000 scanned objects in 190 daily categories, sharing common classes with popular 2D datasets.
Each 3D object is captured with both 2D and 3D sensors, providing textured meshes, point clouds, multiview rendered images, and multiple real-captured videos.
arXiv Detail & Related papers (2023-01-18T18:14:18Z) - Prompt-guided Scene Generation for 3D Zero-Shot Learning [8.658191774247944]
We propose a prompt-guided 3D scene generation and supervision method to augment 3D data to learn the network better.
First, we merge point clouds of two 3D models in certain ways described by a prompt. The prompt acts like the annotation describing each 3D scene.
We have achieved state-of-the-art ZSL and generalized ZSL performance on synthetic (ModelNet40, ModelNet10) and real-scanned (ScanOjbectNN) 3D object datasets.
arXiv Detail & Related papers (2022-09-29T11:24:33Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - RandomRooms: Unsupervised Pre-training from Synthetic Shapes and
Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets.
Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications.
In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.