VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning
- URL: http://arxiv.org/abs/2602.00637v1
- Date: Sat, 31 Jan 2026 10:11:27 GMT
- Title: VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning
- Authors: Vivek Madhavaram, Vartika Sengar, Arkadipta De, Charu Sharma,
- Abstract summary: We propose Viewpoint-Invariant Zero-shot scene graph generation for 3D scene Reasoning (VIZOR)<n>VIZOR is a training-free, end-to-end framework that constructs dense, viewpoint-invariant 3D scene graphs directly from raw 3D scenes.<n>It infers open-vocabulary relationships that describe spatial and proximity relationships among scene objects without requiring annotated training data.
- Score: 1.9190955990713918
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene understanding and reasoning has been a fundamental problem in 3D computer vision, requiring models to identify objects, their properties, and spatial or comparative relationships among the objects. Existing approaches enable this by creating scene graphs using multiple inputs such as 2D images, depth maps, object labels, and annotated relationships from specific reference view. However, these methods often struggle with generalization and produce inaccurate spatial relationships like "left/right", which become inconsistent across different viewpoints. To address these limitations, we propose Viewpoint-Invariant Zero-shot scene graph generation for 3D scene Reasoning (VIZOR). VIZOR is a training-free, end-to-end framework that constructs dense, viewpoint-invariant 3D scene graphs directly from raw 3D scenes. The generated scene graph is unambiguous, as spatial relationships are defined relative to each object's front-facing direction, making them consistent regardless of the reference view. Furthermore, it infers open-vocabulary relationships that describe spatial and proximity relationships among scene objects without requiring annotated training data. We conduct extensive quantitative and qualitative evaluations to assess the effectiveness of VIZOR in scene graph generation and downstream tasks, such as query-based object grounding. VIZOR outperforms state-of-the-art methods, showing clear improvements in scene graph generation and achieving 22% and 4.81% gains in zero-shot grounding accuracy on the Replica and Nr3D datasets, respectively.
Related papers
- SceneLinker: Compositional 3D Scene Generation via Semantic Scene Graph from RGB Sequences [12.771171646896468]
We introduce SceneLinker, a framework that generates compositional 3D scenes via semantic scene graph from RGB sequences.<n>Our work enables users to generate consistent 3D spaces from their physical environments via scene graphs, allowing them to create spatial Mixed Reality (MR) content.
arXiv Detail & Related papers (2026-02-03T01:22:07Z) - RoamScene3D: Immersive Text-to-3D Scene Generation via Adaptive Object-aware Roaming [79.81527946524098]
RoamScene3D is a novel framework that bridges the gap between semantic guidance and spatial generation.<n>We employ a vision-language model (VLM) to construct a scene graph that encodes object relations.<n>To mitigate the limitations of static 2D priors, we introduce a Motion-Injected Inpainting model that is fine-tuned on a synthetic panoramic dataset.
arXiv Detail & Related papers (2026-01-27T10:10:55Z) - ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models [0.0]
ZING-3D is a framework that generates a rich semantic representation of a 3D scene in a zero-shot manner.<n>It also enables incremental updates and geometric grounding in 3D space, making it suitable for downstream robotics applications.<n>Our experiments on scenes from the Replica and HM3D dataset show that ZING-3D is effective at capturing spatial and relational knowledge without the need of task-specific training.
arXiv Detail & Related papers (2025-10-24T00:52:33Z) - 3D scene generation from scene graphs and self-attention [51.49886604454926]
We present a variant of the conditional variational autoencoder (cVAE) model to synthesize 3D scenes from scene graphs and floor plans.
We exploit the properties of self-attention layers to capture high-level relationships between objects in a scene.
arXiv Detail & Related papers (2024-04-02T12:26:17Z) - 3D Scene Diffusion Guidance using Scene Graphs [3.207455883863626]
We propose a novel approach for 3D scene diffusion guidance using scene graphs.
To leverage the relative spatial information the scene graphs provide, we make use of relational graph convolutional blocks within our denoising network.
arXiv Detail & Related papers (2023-08-08T06:16:37Z) - CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph
Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes.
Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes.
The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z) - Generating Visual Spatial Description via Holistic 3D Scene
Understanding [88.99773815159345]
Visual spatial description (VSD) aims to generate texts that describe the spatial relations of the given objects within images.
With an external 3D scene extractor, we obtain the 3D objects and scene features for input images.
We construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes.
arXiv Detail & Related papers (2023-05-19T15:53:56Z) - Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using
Scene Graphs [85.54212143154986]
Controllable scene synthesis consists of generating 3D information that satisfy underlying specifications.
Scene graphs are representations of a scene composed of objects (nodes) and inter-object relationships (edges)
We propose the first work that directly generates shapes from a scene graph in an end-to-end manner.
arXiv Detail & Related papers (2021-08-19T17:59:07Z) - Learning 3D Semantic Scene Graphs from 3D Indoor Reconstructions [94.17683799712397]
We focus on scene graphs, a data structure that organizes the entities of a scene in a graph.
We propose a learned method that regresses a scene graph from the point cloud of a scene.
We show the application of our method in a domain-agnostic retrieval task, where graphs serve as an intermediate representation for 3D-3D and 2D-3D matching.
arXiv Detail & Related papers (2020-04-08T12:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.