VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic
Scene Graph Prediction in Point Cloud
- URL: http://arxiv.org/abs/2303.14408v1
- Date: Sat, 25 Mar 2023 09:14:18 GMT
- Title: VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic
Scene Graph Prediction in Point Cloud
- Authors: Ziqin Wang, Bowen Cheng, Lichen Zhao, Dong Xu, Yang Tang, Lu Sheng
- Abstract summary: 3D semantic scene graph (3DSSG) prediction in the point cloud is challenging since the 3D point cloud only captures geometric structures with limited semantics compared to 2D images.
We propose Visual-Linguistic Semantics Assisted Training scheme that can significantly empower 3DSSG prediction models with discrimination about long-tailed and ambiguous semantic relations.
- Score: 51.063494002003154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of 3D semantic scene graph (3DSSG) prediction in the point cloud is
challenging since (1) the 3D point cloud only captures geometric structures
with limited semantics compared to 2D images, and (2) long-tailed relation
distribution inherently hinders the learning of unbiased prediction. Since 2D
images provide rich semantics and scene graphs are in nature coped with
languages, in this study, we propose Visual-Linguistic Semantics Assisted
Training (VL-SAT) scheme that can significantly empower 3DSSG prediction models
with discrimination about long-tailed and ambiguous semantic relations. The key
idea is to train a powerful multi-modal oracle model to assist the 3D model.
This oracle learns reliable structural representations based on semantics from
vision, language, and 3D geometry, and its benefits can be heterogeneously
passed to the 3D model during the training stage. By effectively utilizing
visual-linguistic semantics in training, our VL-SAT can significantly boost
common 3DSSG prediction models, such as SGFN and SGGpoint, only with 3D inputs
in the inference stage, especially when dealing with tail relation triplets.
Comprehensive evaluations and ablation studies on the 3DSSG dataset have
validated the effectiveness of the proposed scheme. Code is available at
https://github.com/wz7in/CVPR2023-VLSAT.
Related papers
- Bayesian Self-Training for Semi-Supervised 3D Segmentation [59.544558398992386]
3D segmentation is a core problem in computer vision.
densely labeling 3D point clouds to employ fully-supervised training remains too labor intensive and expensive.
Semi-supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set.
arXiv Detail & Related papers (2024-09-12T14:54:31Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - Large Generative Model Assisted 3D Semantic Communication [51.17527319441436]
We propose a Generative AI Model assisted 3D SC (GAM-3DSC) system.
First, we introduce a 3D Semantic Extractor (3DSE) to extract key semantics from a 3D scenario based on user requirements.
We then present an Adaptive Semantic Compression Model (ASCM) for encoding these multi-perspective images.
Finally, we design a conditional Generative adversarial network and Diffusion model aided-Channel Estimation (GDCE) to estimate and refine the Channel State Information (CSI) of physical channels.
arXiv Detail & Related papers (2024-03-09T03:33:07Z) - POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images [32.33170182669095]
We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images.
The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads.
The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks.
arXiv Detail & Related papers (2024-01-17T18:51:53Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.