Related papers: SAT: 2D Semantics Assisted Training for 3D Visual Grounding

SAT: 2D Semantics Assisted Training for 3D Visual Grounding

URL: http://arxiv.org/abs/2105.11450v1
Date: Mon, 24 May 2021 17:58:36 GMT
Title: SAT: 2D Semantics Assisted Training for 3D Visual Grounding
Authors: Zhengyuan Yang, Songyang Zhang, Liwei Wang, Jiebo Luo
Abstract summary: 3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region. Point clouds are sparse, noisy, and contain limited semantic information compared with 2D images. We propose 2D Semantics Assisted Training (SAT) that utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning.
Score: 95.84637054325039
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region. Point clouds are sparse, noisy, and contain limited semantic information compared with 2D images. These inherent limitations make the 3D visual grounding problem more challenging. In this study, we propose 2D Semantics Assisted Training (SAT) that utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning and assist 3D visual grounding. The main idea is to learn auxiliary alignments between rich, clean 2D object representations and the corresponding objects or mentioned entities in 3D scenes. SAT takes 2D object semantics, i.e., object label, image feature, and 2D geometric feature, as the extra input in training but does not require such inputs during inference. By effectively utilizing 2D semantics in training, our approach boosts the accuracy on the Nr3D dataset from 37.7% to 49.2%, which significantly surpasses the non-SAT baseline with the identical network architecture and inference input. Our approach outperforms the state of the art by large margins on multiple 3D visual grounding datasets, i.e., +10.4% absolute accuracy on Nr3D, +9.9% on Sr3D, and +5.6% on ScanRef.

Related papers

Unifying 2D and 3D Vision-Language Understanding [85.84054120018625]
We introduce UniVLG, a unified architecture for 2D and 3D vision-language learning. UniVLG bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems.
arXiv Detail & Related papers (2025-03-13T17:56:22Z)
MPL: Lifting 3D Human Pose from Multi-view 2D Poses [75.26416079541723]
We propose combining 2D pose estimation, for which large and rich training datasets exist, and 2D-to-3D pose lifting, using a transformer-based network. Our experiments demonstrate decreases up to 45% in MPJPE errors compared to the 3D pose obtained by triangulating the 2D poses.
arXiv Detail & Related papers (2024-08-20T12:55:14Z)
Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
We propose a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment. Our 3D-VLA exploits the superior ability of current large-scale vision-language models on aligning the semantics between texts and 2D images. During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z)
CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z)
SSR-2D: Semantic 3D Scene Reconstruction from 2D Images [54.46126685716471]
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations. The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images. Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet.
arXiv Detail & Related papers (2023-02-07T17:47:52Z)
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding [23.672405624011873]
We propose a module to consolidate the 3D visual stream by 2D clues synthesized from point clouds. We empirically show their aptitude to boost the quality of the learned visual representations. Our proposed module, dubbed as Look Around and Refer (LAR), significantly outperforms the state-of-the-art 3D visual grounding techniques on three benchmarks.
arXiv Detail & Related papers (2022-11-25T17:12:08Z)
Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining [21.878815180924832]
We present a novel 3D pretraining method by leveraging 2D networks learned from rich 2D datasets. Our experiments show that the 3D models pretrained with 2D knowledge boost the performances across various real-world 3D downstream tasks.
arXiv Detail & Related papers (2021-04-10T05:40:42Z)
3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images. First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training. Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration. Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z)
Semantic Correspondence via 2D-3D-2D Cycle [58.023058561837686]
We propose a new method on predicting semantic correspondences by leveraging it to 3D domain. We show that our method gives comparative and even superior results on standard semantic benchmarks.
arXiv Detail & Related papers (2020-04-20T05:27:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.