ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models
- URL: http://arxiv.org/abs/2510.21069v1
- Date: Fri, 24 Oct 2025 00:52:33 GMT
- Title: ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models
- Authors: Pranav Saxena, Jimmy Chiun,
- Abstract summary: ZING-3D is a framework that generates a rich semantic representation of a 3D scene in a zero-shot manner.<n>It also enables incremental updates and geometric grounding in 3D space, making it suitable for downstream robotics applications.<n>Our experiments on scenes from the Replica and HM3D dataset show that ZING-3D is effective at capturing spatial and relational knowledge without the need of task-specific training.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding and reasoning about complex 3D environments requires structured scene representations that capture not only objects but also their semantic and spatial relationships. While recent works on 3D scene graph generation have leveraged pretrained VLMs without task-specific fine-tuning, they are largely confined to single-view settings, fail to support incremental updates as new observations arrive and lack explicit geometric grounding in 3D space, all of which are essential for embodied scenarios. In this paper, we propose, ZING-3D, a framework that leverages the vast knowledge of pretrained foundation models to enable open-vocabulary recognition and generate a rich semantic representation of the scene in a zero-shot manner while also enabling incremental updates and geometric grounding in 3D space, making it suitable for downstream robotics applications. Our approach leverages VLM reasoning to generate a rich 2D scene graph, which is grounded in 3D using depth information. Nodes represent open-vocabulary objects with features, 3D locations, and semantic context, while edges capture spatial and semantic relations with inter-object distances. Our experiments on scenes from the Replica and HM3D dataset show that ZING-3D is effective at capturing spatial and relational knowledge without the need of task-specific training.
Related papers
- Unified Semantic Transformer for 3D Scene Understanding [55.415468022487005]
We introduce UNITE, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model.<n>Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry.<n>We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models.
arXiv Detail & Related papers (2025-12-16T12:49:35Z) - NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation [14.046423852723615]
We introduce a novel 3D Gaussian Splatting based hard visual prompting approach to generate diverse viewpoints around target objects.<n>Our method simulates realistic 3D perspectives, effectively augmenting existing hard visual prompts.<n>This training-free strategy integrates seamlessly with prior hard visual prompts, enriching object-descriptive features.
arXiv Detail & Related papers (2025-04-20T14:39:27Z) - 3D Scene Graph Guided Vision-Language Pre-training [11.131667398927394]
3D vision-language (VL) reasoning has gained significant attention due to its potential to bridge the 3D physical world with natural language descriptions.<n>Existing approaches typically follow task-specific, highly specialized paradigms.<n>This paper proposes a 3D scene graph-guided vision-language pre-training framework.
arXiv Detail & Related papers (2024-11-27T16:10:44Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting [53.6394928681237]
holistic understanding of urban scenes based on RGB images is a challenging yet important problem.
Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians.
Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy.
arXiv Detail & Related papers (2024-03-19T13:39:05Z) - Weakly-Supervised 3D Visual Grounding based on Visual Language Alignment [24.63428589906294]
We propose a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment.<n>Our 3D-VLA exploits the superior ability of current large-scale vision-language models on aligning the semantics between texts and 2D images.<n>During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z) - ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and
Planning [125.90002884194838]
ConceptGraphs is an open-vocabulary graph-structured representation for 3D scenes.
It is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association.
We demonstrate the utility of this representation through a number of downstream planning tasks.
arXiv Detail & Related papers (2023-09-28T17:53:38Z) - Generating Visual Spatial Description via Holistic 3D Scene
Understanding [88.99773815159345]
Visual spatial description (VSD) aims to generate texts that describe the spatial relations of the given objects within images.
With an external 3D scene extractor, we obtain the 3D objects and scene features for input images.
We construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes.
arXiv Detail & Related papers (2023-05-19T15:53:56Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.