ZeroScene: A Zero-Shot Framework for 3D Scene Generation from a Single Image and Controllable Texture Editing
- URL: http://arxiv.org/abs/2509.23607v1
- Date: Sun, 28 Sep 2025 03:21:12 GMT
- Title: ZeroScene: A Zero-Shot Framework for 3D Scene Generation from a Single Image and Controllable Texture Editing
- Authors: Xiang Tang, Ruotong Li, Xiaopeng Fan,
- Abstract summary: We propose a novel system to accomplish both single image-to-3D scene reconstruction and texture editing in a zero-shot manner.<n>ZeroScene extracts object-level 2D segmentation and depth information from input images to infer spatial relationships within the scene.<n>It then jointly optimize 3D and 2D projection losses of the point cloud to update object poses for precise scene alignment.
- Score: 36.098009720325436
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the field of 3D content generation, single image scene reconstruction methods still struggle to simultaneously ensure the quality of individual assets and the coherence of the overall scene in complex environments, while texture editing techniques often fail to maintain both local continuity and multi-view consistency. In this paper, we propose a novel system ZeroScene, which leverages the prior knowledge of large vision models to accomplish both single image-to-3D scene reconstruction and texture editing in a zero-shot manner. ZeroScene extracts object-level 2D segmentation and depth information from input images to infer spatial relationships within the scene. It then jointly optimizes 3D and 2D projection losses of the point cloud to update object poses for precise scene alignment, ultimately constructing a coherent and complete 3D scene that encompasses both foreground and background. Moreover, ZeroScene supports texture editing of objects in the scene. By imposing constraints on the diffusion model and introducing a mask-guided progressive image generation strategy, we effectively maintain texture consistency across multiple viewpoints and further enhance the realism of rendered results through Physically Based Rendering (PBR) material estimation. Experimental results demonstrate that our framework not only ensures the geometric and appearance accuracy of generated assets, but also faithfully reconstructs scene layouts and produces highly detailed textures that closely align with text prompts.
Related papers
- 3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework [7.4570191712029965]
3D-RE-GEN is a compositional framework that reconstructs a single image into textured 3D objects and a background.<n>Our pipeline integrates models for asset detection, reconstruction, and placement, pushing certain models beyond their originally intended domains.<n>Unlike current methods, 3D-RE-GEN generates a comprehensive background that spatially constrains objects during optimization.
arXiv Detail & Related papers (2025-12-19T11:20:52Z) - Constructing a 3D Town from a Single Image [23.231661811526955]
3DTown is a training-free framework designed to synthesize realistic and coherent 3D scenes from a single top-down view.<n>We decompose the input image into overlapping regions and generate each using a pretrained 3D object generator.<n>Our results demonstrate that high-quality 3D town generation is achievable from a single image using a principled, training-free approach.
arXiv Detail & Related papers (2025-05-21T17:10:47Z) - HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.<n>We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z) - CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image [44.8172828045897]
Current methods often struggle with domain-specific limitations or low-quality object generation.<n>We propose CAST, a novel method for 3D scene reconstruction and recovery.
arXiv Detail & Related papers (2025-02-18T14:29:52Z) - BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation [54.12899218104669]
3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures.<n>Current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators.<n>We propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation.
arXiv Detail & Related papers (2025-01-15T11:33:34Z) - Zero-Shot Multi-Object Scene Completion [59.325611678171974]
We present a 3D scene completion method that recovers the complete geometry of multiple unseen objects in complex scenes from a single RGB-D image.
Our method outperforms the current state-of-the-art on both synthetic and real-world datasets.
arXiv Detail & Related papers (2024-03-21T17:59:59Z) - SceneWiz3D: Towards Text-guided 3D Scene Composition [134.71933134180782]
Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets.
We introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text.
arXiv Detail & Related papers (2023-12-13T18:59:30Z) - Single-view 3D Scene Reconstruction with High-fidelity Shape and Texture [47.44029968307207]
We propose a novel framework for simultaneous high-fidelity recovery of object shapes and textures from single-view images.
Our approach utilizes the proposed Single-view neural implicit Shape and Radiance field (SSR) representations to leverage both explicit 3D shape supervision and volume rendering.
A distinctive feature of our framework is its ability to generate fine-grained textured meshes while seamlessly integrating rendering capabilities into the single-view 3D reconstruction model.
arXiv Detail & Related papers (2023-11-01T11:46:15Z) - CompNVS: Novel View Synthesis with Scene Completion [83.19663671794596]
We propose a generative pipeline performing on a sparse grid-based neural scene representation to complete unobserved scene parts.
We process encoded image features in 3D space with a geometry completion network and a subsequent texture inpainting network to extrapolate the missing area.
Photorealistic image sequences can be finally obtained via consistency-relevant differentiable rendering.
arXiv Detail & Related papers (2022-07-23T09:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.