NuiWorld: Exploring a Scalable Framework for End-to-End Controllable World Generation
- URL: http://arxiv.org/abs/2601.19048v1
- Date: Tue, 27 Jan 2026 00:04:02 GMT
- Title: NuiWorld: Exploring a Scalable Framework for End-to-End Controllable World Generation
- Authors: Han-Hung Lee, Cheng-Yu Yang, Yu-Lun Liu, Angel X. Chang,
- Abstract summary: We present NuiWorld, a framework that attempts to address the challenges of world generation.<n>We synthesize scenes of varying sizes and layouts, producing enough data to train an end-to-end model.<n>Our framework enables controllability through pseudo sketch labels, and demonstrates a degree of generalization to previously unseen sketches.
- Score: 19.342379757491177
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: World generation is a fundamental capability for applications like video games, simulation, and robotics. However, existing approaches face three main obstacles: controllability, scalability, and efficiency. End-to-end scene generation models have been limited by data scarcity. While object-centric generation approaches rely on fixed resolution representations, degrading fidelity for larger scenes. Training-free approaches, while flexible, are often slow and computationally expensive at inference time. We present NuiWorld, a framework that attempts to address these challenges. To overcome data scarcity, we propose a generative bootstrapping strategy that starts from a few input images. Leveraging recent 3D reconstruction and expandable scene generation techniques, we synthesize scenes of varying sizes and layouts, producing enough data to train an end-to-end model. Furthermore, our framework enables controllability through pseudo sketch labels, and demonstrates a degree of generalization to previously unseen sketches. Our approach represents scenes as a collection of variable scene chunks, which are compressed into a flattened vector-set representation. This significantly reduces the token length for large scenes, enabling consistent geometric fidelity across scenes sizes while improving training and inference efficiency.
Related papers
- TRELLISWorld: Training-Free World Generation from Object Generators [13.962895984556582]
Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation.<n>Existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability.<n>We present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators.
arXiv Detail & Related papers (2025-10-27T21:40:31Z) - WorldGrow: Generating Infinite 3D World [75.81531067447203]
We tackle the challenge of generating the infinitely extendable 3D world -- large, continuous environments with coherent geometry and realistic appearance.<n>We propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis.<n>Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity.
arXiv Detail & Related papers (2025-10-24T17:39:52Z) - Graph-Guided Dual-Level Augmentation for 3D Scene Segmentation [21.553363236403822]
3D point cloud segmentation aims to assign semantic labels to individual points in a scene for fine-grained spatial understanding.<n>Existing methods typically adopt data augmentation to alleviate the burden of large-scale annotation.<n>We propose a graph-guided data augmentation framework with dual-level constraints for realistic 3D scene synthesis.
arXiv Detail & Related papers (2025-07-30T13:25:36Z) - BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation [54.12899218104669]
3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures.<n>Current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators.<n>We propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation.
arXiv Detail & Related papers (2025-01-15T11:33:34Z) - MegaScenes: Scene-Level View Synthesis at Scale [69.21293001231993]
Scene-level novel view synthesis (NVS) is fundamental to many vision and graphics applications.
We create a large-scale scene-level dataset from Internet photo collections, called MegaScenes, which contains over 100K structure from motion (SfM) reconstructions from around the world.
We analyze failure cases of state-of-the-art NVS methods and significantly improve generation consistency.
arXiv Detail & Related papers (2024-06-17T17:55:55Z) - Self-supervised novel 2D view synthesis of large-scale scenes with
efficient multi-scale voxel carving [77.07589573960436]
We introduce an efficient multi-scale voxel carving method to generate novel views of real scenes.
Our final high-resolution output is efficiently self-trained on data automatically generated by the voxel carving module.
We demonstrate the effectiveness of our method on highly complex and large-scale scenes in real environments.
arXiv Detail & Related papers (2023-06-26T13:57:05Z) - CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph
Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes.
Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes.
The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z) - Future Urban Scenes Generation Through Vehicles Synthesis [90.1731992199415]
We propose a deep learning pipeline to predict the visual future appearance of an urban scene.
We follow a two stages approach, where interpretable information is included in the loop and each actor is modelled independently.
We show the superiority of this approach over traditional end-to-end scene-generation methods on CityFlow.
arXiv Detail & Related papers (2020-07-01T08:40:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.