HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition
- URL: http://arxiv.org/abs/2510.27148v1
- Date: Fri, 31 Oct 2025 03:50:47 GMT
- Title: HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition
- Authors: Jiacheng Hong, Kunzhen Wu, Mingrui Yu, Yichao Gu, Shengze Xue, Shuangjiu Xiao, Deli Dong,
- Abstract summary: We propose HiGS, a hierarchical generative framework for multi-step associative semantic spatial composition.<n>HiGS enables users to iteratively expand scenes by selecting key semantic objects, offering fine-grained control over regions of interest.<n>To support structured and coherent generation, we introduce the Progressive Hierarchical Spatial-Semantic Graph.
- Score: 1.9131307324613616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Three-dimensional scene generation holds significant potential in gaming, film, and virtual reality. However, most existing methods adopt a single-step generation process, making it difficult to balance scene complexity with minimal user input. Inspired by the human cognitive process in scene modeling, which progresses from global to local, focuses on key elements, and completes the scene through semantic association, we propose HiGS, a hierarchical generative framework for multi-step associative semantic spatial composition. HiGS enables users to iteratively expand scenes by selecting key semantic objects, offering fine-grained control over regions of interest while the model completes peripheral areas automatically. To support structured and coherent generation, we introduce the Progressive Hierarchical Spatial-Semantic Graph (PHiSSG), which dynamically organizes spatial relationships and semantic dependencies across the evolving scene structure. PHiSSG ensures spatial and geometric consistency throughout the generation process by maintaining a one-to-one mapping between graph nodes and generated objects and supporting recursive layout optimization. Experiments demonstrate that HiGS outperforms single-stage methods in layout plausibility, style consistency, and user preference, offering a controllable and extensible paradigm for efficient 3D scene construction.
Related papers
- Asset-Driven Sematic Reconstruction of Dynamic Scene with Multi-Human-Object Interactions [41.29588736908775]
3D geometry modeling of dynamic scenes is crucial for applications like AR/VR, gaming, and embodied AI.<n>We propose a hybrid approach that combines the advantages of 1) 3D generative models for generating high-fidelity meshes of the scene elements, 2) Semantic-aware deformation, and 3) GS-based optimization of the individual elements.<n>Our method outperforms the state-of-the-art method in producing better surface reconstruction of such scenes.
arXiv Detail & Related papers (2025-11-29T16:36:22Z) - KeySG: Hierarchical Keyframe-Based 3D Scene Graphs [1.5134439544218246]
KeySG represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements.<n>We leverage VLM to extract scene information, alleviating the need to explicitly model relationship edges between objects.<n>Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs.
arXiv Detail & Related papers (2025-10-01T15:53:27Z) - Graph-Guided Dual-Level Augmentation for 3D Scene Segmentation [21.553363236403822]
3D point cloud segmentation aims to assign semantic labels to individual points in a scene for fine-grained spatial understanding.<n>Existing methods typically adopt data augmentation to alleviate the burden of large-scale annotation.<n>We propose a graph-guided data augmentation framework with dual-level constraints for realistic 3D scene synthesis.
arXiv Detail & Related papers (2025-07-30T13:25:36Z) - X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability [49.4647778989539]
X-Scene is a novel framework for large-scale driving scene generation.<n>It achieves both geometric intricacy and appearance fidelity, while offering flexible controllability.<n>X-Scene significantly advances controllability and fidelity for large-scale driving scene generation.
arXiv Detail & Related papers (2025-06-16T14:43:18Z) - Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z) - HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.<n>We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z) - Universal Scene Graph Generation [77.53076485727414]
We present Universal Universal SG (USG), a novel representation capable of characterizing comprehensive semantic scenes.<n>We also introduce USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges.
arXiv Detail & Related papers (2025-03-19T08:55:06Z) - Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment [130.15775113897553]
Finsta is a fine-grained structural-temporal alignment learning method.
It consistently improves the existing 13 strong-tuning video-language models.
arXiv Detail & Related papers (2024-06-27T15:23:36Z) - CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph
Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes.
Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes.
The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.