Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph
prediction
- URL: http://arxiv.org/abs/2310.16494v1
- Date: Wed, 25 Oct 2023 09:26:16 GMT
- Title: Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph
prediction
- Authors: Sebastian Koch, Pedro Hermosilla, Narunas Vaskevicius, Mirco Colosi,
Timo Ropinski
- Abstract summary: We present the first language-based pre-training approach for 3D scene graphs.
We leverage the language encoder of CLIP, a popular vision-language model, to distill its knowledge into our graph-based network.
Our method achieves state-of-the-art results on the main semantic 3D scene graph benchmark.
- Score: 16.643252717745348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: D scene graphs are an emerging 3D scene representation, that models both the
objects present in the scene as well as their relationships. However, learning
3D scene graphs is a challenging task because it requires not only object
labels but also relationship annotations, which are very scarce in datasets.
While it is widely accepted that pre-training is an effective approach to
improve model performance in low data regimes, in this paper, we find that
existing pre-training methods are ill-suited for 3D scene graphs. To solve this
issue, we present the first language-based pre-training approach for 3D scene
graphs, whereby we exploit the strong relationship between scene graphs and
language. To this end, we leverage the language encoder of CLIP, a popular
vision-language model, to distill its knowledge into our graph-based network.
We formulate a contrastive pre-training, which aligns text embeddings of
relationships (subject-predicate-object triplets) and predicted 3D graph
features. Our method achieves state-of-the-art results on the main semantic 3D
scene graph benchmark by showing improved effectiveness over pre-training
baselines and outperforming all the existing fully supervised scene graph
prediction methods by a significant margin. Furthermore, since our scene graph
features are language-aligned, it allows us to query the language space of the
features in a zero-shot manner. In this paper, we show an example of utilizing
this property of the features to predict the room type of a scene without
further training.
Related papers
- ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding [2.5165775267615205]
This work is the first to implement an Equivariant Graph Neural Network in semantic scene graph generation from 3D point clouds for scene understanding.
Our proposed method, ESGNN, outperforms existing state-of-the-art approaches, demonstrating a significant improvement in scene estimation with faster convergence.
arXiv Detail & Related papers (2024-06-30T06:58:04Z) - GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs [74.98581417902201]
We propose a novel framework to generate compositional 3D scenes from scene graphs.
By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model.
We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer.
arXiv Detail & Related papers (2023-11-30T18:59:58Z) - SGRec3D: Self-Supervised 3D Scene Graph Learning via Object-Level Scene
Reconstruction [16.643252717745348]
We present SGRec3D, a novel self-supervised pre-training method for 3D scene graph prediction.
Pre-training SGRec3D does not require object relationship labels, making it possible to exploit large-scale 3D scene understanding datasets.
Our experiments demonstrate that in contrast to recent point cloud-based pre-training approaches, our proposed pre-training improves the 3D scene graph prediction considerably.
arXiv Detail & Related papers (2023-09-27T14:45:29Z) - Incremental 3D Semantic Scene Graph Prediction from RGB Sequences [86.77318031029404]
We propose a real-time framework that incrementally builds a consistent 3D semantic scene graph of a scene given an RGB image sequence.
Our method consists of a novel incremental entity estimation pipeline and a scene graph prediction network.
The proposed network estimates 3D semantic scene graphs with iterative message passing using multi-view and geometric features extracted from the scene entities.
arXiv Detail & Related papers (2023-05-04T11:32:16Z) - Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z) - Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using
Scene Graphs [85.54212143154986]
Controllable scene synthesis consists of generating 3D information that satisfy underlying specifications.
Scene graphs are representations of a scene composed of objects (nodes) and inter-object relationships (edges)
We propose the first work that directly generates shapes from a scene graph in an end-to-end manner.
arXiv Detail & Related papers (2021-08-19T17:59:07Z) - SceneGraphFusion: Incremental 3D Scene Graph Prediction from RGB-D
Sequences [76.28527350263012]
We propose a method to incrementally build up semantic scene graphs from a 3D environment given a sequence of RGB-D frames.
We aggregate PointNet features from primitive scene components by means of a graph neural network.
Our approach outperforms 3D scene graph prediction methods by a large margin and its accuracy is on par with other 3D semantic and panoptic segmentation methods while running at 35 Hz.
arXiv Detail & Related papers (2021-03-27T13:00:36Z) - Learning 3D Semantic Scene Graphs from 3D Indoor Reconstructions [94.17683799712397]
We focus on scene graphs, a data structure that organizes the entities of a scene in a graph.
We propose a learned method that regresses a scene graph from the point cloud of a scene.
We show the application of our method in a domain-agnostic retrieval task, where graphs serve as an intermediate representation for 3D-3D and 2D-3D matching.
arXiv Detail & Related papers (2020-04-08T12:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.