Estimating Commonsense Scene Composition on Belief Scene Graphs
- URL: http://arxiv.org/abs/2505.02405v1
- Date: Mon, 05 May 2025 06:55:59 GMT
- Title: Estimating Commonsense Scene Composition on Belief Scene Graphs
- Authors: Mario A. V. Saucedo, Vignesh Kottayam Viswanathan, Christoforos Kanellakis, George Nikolakopoulos,
- Abstract summary: This work establishes the concept of commonsense scene composition, with a focus on extending Belief Scene Graphs.<n>Specifically, the commonsense scene composition capability refers to the understanding of the spatial relationships among related objects in the scene.<n>The framework has been validated through multiple runs on simulated data, as well as in a real-world indoor environment.
- Score: 5.62479170374811
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work establishes the concept of commonsense scene composition, with a focus on extending Belief Scene Graphs by estimating the spatial distribution of unseen objects. Specifically, the commonsense scene composition capability refers to the understanding of the spatial relationships among related objects in the scene, which in this article is modeled as a joint probability distribution for all possible locations of the semantic object class. The proposed framework includes two variants of a Correlation Information (CECI) model for learning probability distributions: (i) a baseline approach based on a Graph Convolutional Network, and (ii) a neuro-symbolic extension that integrates a spatial ontology based on Large Language Models (LLMs). Furthermore, this article provides a detailed description of the dataset generation process for such tasks. Finally, the framework has been validated through multiple runs on simulated data, as well as in a real-world indoor environment, demonstrating its ability to spatially interpret scenes across different room types.
Related papers
- Exploiting Object-based and Segmentation-based Semantic Features for Deep Learning-based Indoor Scene Classification [0.5572976467442564]
The work described in this paper uses both semantic information, obtained from object detection, and semantic segmentation techniques.
A novel approach that uses a semantic segmentation mask to provide Hu-moments-based segmentation categories' shape characterization, designated by Hu-Moments Features (SHMFs) is proposed.
A three-main-branch network, designated by GOS$2$F$2$App, that exploits deep-learning-based global features, object-based features, and semantic segmentation-based features is also proposed.
arXiv Detail & Related papers (2024-04-11T13:37:51Z) - N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields [112.02885337510716]
Nested Neural Feature Fields (N2F2) is a novel approach that employs hierarchical supervision to learn a single feature field.
We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space.
Our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization.
arXiv Detail & Related papers (2024-03-16T18:50:44Z) - LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion.
We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z) - CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph
Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes.
Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes.
The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z) - Semantic-guided modeling of spatial relation and object co-occurrence for indoor scene recognition [5.083140094792973]
SpaCoNet simultaneously models Spatial relation and Co-occurrence of objects guided by semantic segmentation.<n> Experimental results on three widely used scene datasets demonstrate the effectiveness and generality of the proposed method.
arXiv Detail & Related papers (2023-05-22T03:04:22Z) - Framework-agnostic Semantically-aware Global Reasoning for Segmentation [29.69187816377079]
We propose a component that learns to project image features into latent representations and reason between them.
Our design encourages the latent regions to represent semantic concepts by ensuring that the activated regions are spatially disjoint.
Our latent tokens are semantically interpretable and diverse and provide a rich set of features that can be transferred to downstream tasks.
arXiv Detail & Related papers (2022-12-06T21:42:05Z) - Spatial Reasoning for Few-Shot Object Detection [21.3564383157159]
We propose a spatial reasoning framework that detects novel objects with only a few training examples in a context.
We employ a graph convolutional network as the RoIs and their relatedness are defined as nodes and edges, respectively.
We demonstrate that the proposed method significantly outperforms the state-of-the-art methods and verify its efficacy through extensive ablation studies.
arXiv Detail & Related papers (2022-11-02T12:38:08Z) - Iterative Scene Graph Generation [55.893695946885174]
Scene graph generation involves identifying object entities and their corresponding interaction predicates in a given image (or video)
Existing approaches to scene graph generation assume certain factorization of the joint distribution to make the estimation iteration feasible.
We propose a novel framework that addresses this limitation, as well as introduces dynamic conditioning on the image.
arXiv Detail & Related papers (2022-07-27T10:37:29Z) - Cross-Modal Discrete Representation Learning [73.68393416984618]
We present a self-supervised learning framework that learns a representation that captures finer levels of granularity across different modalities.
Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities.
arXiv Detail & Related papers (2021-06-10T00:23:33Z) - SIRI: Spatial Relation Induced Network For Spatial Description
Resolution [64.38872296406211]
We propose a novel relationship induced (SIRI) network for language-guided localization.
We show that our method is around 24% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius.
Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown.
arXiv Detail & Related papers (2020-10-27T14:04:05Z) - Multi-Scale Representation Learning for Spatial Feature Distributions
using Grid Cells [11.071527762096053]
We propose a representation learning model called Space2Vec to encode the absolute positions and spatial relationships of places.
Results show that because of its multi-scale representations, Space2Vec outperforms well-established ML approaches.
arXiv Detail & Related papers (2020-02-16T04:22:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.