GeoDiff3D: Self-Supervised 3D Scene Generation with Geometry-Constrained 2D Diffusion Guidance
- URL: http://arxiv.org/abs/2601.19785v2
- Date: Wed, 28 Jan 2026 16:02:47 GMT
- Title: GeoDiff3D: Self-Supervised 3D Scene Generation with Geometry-Constrained 2D Diffusion Guidance
- Authors: Haozhi Zhu, Miaomiao Zhao, Dingyao Liu, Runze Tian, Yan Zhang, Jie Guo, Fenggen Yu,
- Abstract summary: 3D scene generation is a core technology for gaming, film/VFX, and VR/AR.<n>Existing methods broadly follow two paradigms - indirect 2D-to-3D reconstruction and direct 3D generation.<n>We propose GeoDiff3D, an efficient self-supervised framework that uses coarse geometry as a structural anchor and a geometry-constrained 2D diffusion model to provide texture-rich reference images.
- Score: 8.625308061265754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D scene generation is a core technology for gaming, film/VFX, and VR/AR. Growing demand for rapid iteration, high-fidelity detail, and accessible content creation has further increased interest in this area. Existing methods broadly follow two paradigms - indirect 2D-to-3D reconstruction and direct 3D generation - but both are limited by weak structural modeling and heavy reliance on large-scale ground-truth supervision, often producing structural artifacts, geometric inconsistencies, and degraded high-frequency details in complex scenes. We propose GeoDiff3D, an efficient self-supervised framework that uses coarse geometry as a structural anchor and a geometry-constrained 2D diffusion model to provide texture-rich reference images. Importantly, GeoDiff3D does not require strict multi-view consistency of the diffusion-generated references and remains robust to the resulting noisy, inconsistent guidance. We further introduce voxel-aligned 3D feature aggregation and dual self-supervision to maintain scene coherence and fine details while substantially reducing dependence on labeled data. GeoDiff3D also trains with low computational cost and enables fast, high-quality 3D scene generation. Extensive experiments on challenging scenes show improved generalization and generation quality over existing baselines, offering a practical solution for accessible and efficient 3D scene construction.
Related papers
- Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement [12.855027334688382]
Photo3D is a framework for advancing 3D generation driven by the GPT-4o-Image model image data.<n>We present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency.<n>Our scheme is general to different 3D-native generators, and we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms.
arXiv Detail & Related papers (2025-12-09T12:33:48Z) - LATTICE: Democratize High-Fidelity 3D Generation at Scale [27.310104395842075]
LATTICE is a new framework for high-fidelity 3D asset generation.<n> VoxSet is a semi-structured representation that compresses 3D assets into a compact set of latent vectors anchored to a coarse voxel grid.<n>Our method is simple at its core, but supports arbitrary resolution decoding, low-cost training, and flexible inference schemes.
arXiv Detail & Related papers (2025-11-24T03:22:19Z) - IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction [82.53307702809606]
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions.<n>We propose InstanceGrounded Geometry Transformer (IGGT) to unify the knowledge for both spatial reconstruction and instance-level contextual understanding.
arXiv Detail & Related papers (2025-10-26T14:57:44Z) - WorldGrow: Generating Infinite 3D World [75.81531067447203]
We tackle the challenge of generating the infinitely extendable 3D world -- large, continuous environments with coherent geometry and realistic appearance.<n>We propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis.<n>Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity.
arXiv Detail & Related papers (2025-10-24T17:39:52Z) - Constructing a 3D Scene from a Single Image [31.11317559252235]
SceneFuse-3D is a training-free framework designed to synthesize coherent 3D scenes from a single top-down view.<n>We decompose the input image into overlapping regions and generate each using a pretrained 3D object generator.<n>This modular design allows us to overcome resolution bottlenecks and preserve spatial structure without requiring 3D supervision or fine-tuning.
arXiv Detail & Related papers (2025-05-21T17:10:47Z) - BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation [54.12899218104669]
3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures.<n>Current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators.<n>We propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation.
arXiv Detail & Related papers (2025-01-15T11:33:34Z) - GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation [65.33726478659304]
We introduce the Geometry-Aware Large Reconstruction Model (GeoLRM), an approach which can predict high-quality assets with 512k Gaussians and 21 input images in only 11 GB GPU memory.
Previous works neglect the inherent sparsity of 3D structure and do not utilize explicit geometric relationships between 3D and 2D images.
GeoLRM tackles these issues by incorporating a novel 3D-aware transformer structure that directly processes 3D points and uses deformable cross-attention mechanisms.
arXiv Detail & Related papers (2024-06-21T17:49:31Z) - DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation [53.20147419879056]
We introduce a diffusion-based feed-forward framework to address challenges with a single model.
Building upon our 3D-aware Diffusion model with TransFormer, we propose a stronger version for 3D generation, i.e., DiffTF++.
Experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules.
arXiv Detail & Related papers (2024-05-13T17:59:51Z) - 3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation [51.64796781728106]
We propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior to 2D diffusion model and the global 3D information of the current scene.
Our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
arXiv Detail & Related papers (2024-03-14T14:31:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.