SAT: 2D Semantics Assisted Training for 3D Visual Grounding
- URL: http://arxiv.org/abs/2105.11450v1
- Date: Mon, 24 May 2021 17:58:36 GMT
- Title: SAT: 2D Semantics Assisted Training for 3D Visual Grounding
- Authors: Zhengyuan Yang, Songyang Zhang, Liwei Wang, Jiebo Luo
- Abstract summary: 3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region.
Point clouds are sparse, noisy, and contain limited semantic information compared with 2D images.
We propose 2D Semantics Assisted Training (SAT) that utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning.
- Score: 95.84637054325039
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D visual grounding aims at grounding a natural language description about a
3D scene, usually represented in the form of 3D point clouds, to the targeted
object region. Point clouds are sparse, noisy, and contain limited semantic
information compared with 2D images. These inherent limitations make the 3D
visual grounding problem more challenging. In this study, we propose 2D
Semantics Assisted Training (SAT) that utilizes 2D image semantics in the
training stage to ease point-cloud-language joint representation learning and
assist 3D visual grounding. The main idea is to learn auxiliary alignments
between rich, clean 2D object representations and the corresponding objects or
mentioned entities in 3D scenes. SAT takes 2D object semantics, i.e., object
label, image feature, and 2D geometric feature, as the extra input in training
but does not require such inputs during inference. By effectively utilizing 2D
semantics in training, our approach boosts the accuracy on the Nr3D dataset
from 37.7% to 49.2%, which significantly surpasses the non-SAT baseline with
the identical network architecture and inference input. Our approach
outperforms the state of the art by large margins on multiple 3D visual
grounding datasets, i.e., +10.4% absolute accuracy on Nr3D, +9.9% on Sr3D, and
+5.6% on ScanRef.
Related papers
- Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
Existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries.
We propose textbf3D-VLA, a weakly supervised approach for textbf3D visual grounding based on textbfVisual textbfLinguistic textbfAlignment.
Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - SSR-2D: Semantic 3D Scene Reconstruction from 2D Images [54.46126685716471]
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations.
The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images.
Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet.
arXiv Detail & Related papers (2023-02-07T17:47:52Z) - Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for
3D Visual Grounding [23.672405624011873]
We propose a module to consolidate the 3D visual stream by 2D clues synthesized from point clouds.
We empirically show their aptitude to boost the quality of the learned visual representations.
Our proposed module, dubbed as Look Around and Refer (LAR), significantly outperforms the state-of-the-art 3D visual grounding techniques on three benchmarks.
arXiv Detail & Related papers (2022-11-25T17:12:08Z) - Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining [21.878815180924832]
We present a novel 3D pretraining method by leveraging 2D networks learned from rich 2D datasets.
Our experiments show that the 3D models pretrained with 2D knowledge boost the performances across various real-world 3D downstream tasks.
arXiv Detail & Related papers (2021-04-10T05:40:42Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z) - Bidirectional Projection Network for Cross Dimension Scene Understanding [69.29443390126805]
We present a emphbidirectional projection network (BPNet) for joint 2D and 3D reasoning in an end-to-end manner.
Via the emphBPM, complementary 2D and 3D information can interact with each other in multiple architectural levels.
Our emphBPNet achieves top performance on the ScanNetV2 benchmark for both 2D and 3D semantic segmentation.
arXiv Detail & Related papers (2021-03-26T08:31:39Z) - Semantic Correspondence via 2D-3D-2D Cycle [58.023058561837686]
We propose a new method on predicting semantic correspondences by leveraging it to 3D domain.
We show that our method gives comparative and even superior results on standard semantic benchmarks.
arXiv Detail & Related papers (2020-04-20T05:27:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.