I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners
- URL: http://arxiv.org/abs/2512.13683v1
- Date: Mon, 15 Dec 2025 18:59:13 GMT
- Title: I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners
- Authors: Lu Ling, Yunhao Ge, Yichen Sheng, Aniket Bera,
- Abstract summary: Generalization remains the central challenge for interactive 3D scene generation.<n>We reprogram a pre-trained 3D instance generator to act as a scene level learner.<n>We show that spatial reasoning still emerges even when the training scenes are randomly composed objects.
- Score: 21.18471823625016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: https://luling06.github.io/I-Scene-project/
Related papers
- Beyond Pixel Histories: World Models with Persistent 3D State [50.4601060508243]
PERSIST is a new paradigm of world model which simulates the evolution of a latent 3D scene.<n>We show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods.
arXiv Detail & Related papers (2026-03-03T19:58:31Z) - TRELLISWorld: Training-Free World Generation from Object Generators [13.962895984556582]
Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation.<n>Existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability.<n>We present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators.
arXiv Detail & Related papers (2025-10-27T21:40:31Z) - IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction [82.53307702809606]
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions.<n>We propose InstanceGrounded Geometry Transformer (IGGT) to unify the knowledge for both spatial reconstruction and instance-level contextual understanding.
arXiv Detail & Related papers (2025-10-26T14:57:44Z) - ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models [0.0]
ZING-3D is a framework that generates a rich semantic representation of a 3D scene in a zero-shot manner.<n>It also enables incremental updates and geometric grounding in 3D space, making it suitable for downstream robotics applications.<n>Our experiments on scenes from the Replica and HM3D dataset show that ZING-3D is effective at capturing spatial and relational knowledge without the need of task-specific training.
arXiv Detail & Related papers (2025-10-24T00:52:33Z) - GenSpace: Benchmarking Spatially-Aware Image Generation [76.98817635685278]
Humans intuitively compose and arrange scenes in the 3D space for photography.<n>Can advanced AI image generators plan scenes with similar 3D spatial awareness when creating images from text or image prompts?<n>We present GenSpace, a novel benchmark and evaluation pipeline to assess the spatial awareness of current image generation models.
arXiv Detail & Related papers (2025-05-30T17:59:26Z) - Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z) - Planner3D: LLM-enhanced graph prior meets 3D indoor scene explicit regularization [31.52569918586902]
3D scene synthesis has diverse applications across a spectrum of industries such as robotics, films, and video games.
In this paper, we aim at generating realistic and reasonable 3D indoor scenes from scene graph.
Our method achieves better 3D scene synthesis, especially in terms of scene-level fidelity.
arXiv Detail & Related papers (2024-03-19T15:54:48Z) - CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph
Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes.
Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes.
The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z) - Object Scene Representation Transformer [56.40544849442227]
We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis.
OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods.
It is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder.
arXiv Detail & Related papers (2022-06-14T15:40:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.