CC-FMO: Camera-Conditioned Zero-Shot Single Image to 3D Scene Generation with Foundation Model Orchestration
- URL: http://arxiv.org/abs/2512.00493v1
- Date: Sat, 29 Nov 2025 14:01:13 GMT
- Title: CC-FMO: Camera-Conditioned Zero-Shot Single Image to 3D Scene Generation with Foundation Model Orchestration
- Authors: Boshi Tang, Henry Zheng, Rui Huang, Gao Huang,
- Abstract summary: High-quality 3D scene generation from a single image is crucial for AR/VR and embodied AI applications.<n>This paper introduces CC-FMO, a zero-shot, camera-conditioned pipeline for single-image to 3D scene generation.
- Score: 29.052223430061826
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: High-quality 3D scene generation from a single image is crucial for AR/VR and embodied AI applications. Early approaches struggle to generalize due to reliance on specialized models trained on curated small datasets. While recent advancements in large-scale 3D foundation models have significantly enhanced instance-level generation, coherent scene generation remains a challenge, where performance is limited by inaccurate per-object pose estimations and spatial inconsistency. To this end, this paper introduces CC-FMO, a zero-shot, camera-conditioned pipeline for single-image to 3D scene generation that jointly conforms to the object layout in input image and preserves instance fidelity. CC-FMO employs a hybrid instance generator that combines semantics-aware vector-set representation with detail-rich structured latent representation, yielding object geometries that are both semantically plausible and high-quality. Furthermore, CC-FMO enables the application of foundational pose estimation models in the scene generation task via a simple yet effective camera-conditioned scale-solving algorithm, to enforce scene-level coherence. Extensive experiments demonstrate that CC-FMO consistently generates high-fidelity camera-aligned compositional scenes, outperforming all state-of-the-art methods.
Related papers
- FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation [50.71369329585773]
We introduce FACE, a novel Autoregressive Autoencoder framework that generates meshes at the face level.<n>Our one-face-one-token strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token.<n> FACE achieves state-of-the-art reconstruction quality on standard benchmarks.
arXiv Detail & Related papers (2026-03-02T06:47:15Z) - TRELLISWorld: Training-Free World Generation from Object Generators [13.962895984556582]
Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation.<n>Existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability.<n>We present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators.
arXiv Detail & Related papers (2025-10-27T21:40:31Z) - ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation [44.75113949778924]
ARTDECO is a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines.<n>We show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization.
arXiv Detail & Related papers (2025-10-09T17:57:38Z) - DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion [50.90541069907167]
We propose DeOcc-1-to-3, an end-to-end framework for occlusion-aware multi-view generation.<n>Our self-supervised training pipeline leverages occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency.
arXiv Detail & Related papers (2025-06-26T17:58:26Z) - 3D Scene Understanding Through Local Random Access Sequence Modeling [12.689247678229382]
3D scene understanding from single images is a pivotal problem in computer vision.<n>We propose an autoregressive generative approach called Local Random Access Sequence (LRAS) modeling.<n>By utilizing optical flow as an intermediate representation for 3D scene editing, our experiments demonstrate that LRAS achieves state-of-the-art novel view synthesis and 3D object manipulation capabilities.
arXiv Detail & Related papers (2025-04-04T18:59:41Z) - LPA3D: 3D Room-Level Scene Generation from In-the-Wild Images [23.258004561060563]
We introduce LPA-GAN, a novel NeRF-based generative approach that incorporates specific modifications to estimate the priors of camera poses under LPA.<n>Our method achieves superior view-to-view consistency and semantic normality.
arXiv Detail & Related papers (2025-04-03T07:18:48Z) - HORT: Monocular Hand-held Objects Reconstruction with Transformers [61.36376511119355]
Reconstructing hand-held objects in 3D from monocular images is a significant challenge in computer vision.<n>We propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects.<n>Our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.
arXiv Detail & Related papers (2025-03-27T09:45:09Z) - FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction [69.63414788486578]
FreeSplatter is a scalable feed-forward framework that generates high-quality 3D Gaussians from uncalibrated sparse-view images.<n>Our approach employs a streamlined transformer architecture where self-attention blocks facilitate information exchange.<n>We develop two specialized variants--for object-centric and scene-level reconstruction--trained on comprehensive datasets.
arXiv Detail & Related papers (2024-12-12T18:52:53Z) - ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance [76.7746870349809]
We present ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models.
Our proposed framework emphasizes spatial alignment of objects, compared with standard score distillation sampling.
arXiv Detail & Related papers (2024-03-19T03:39:43Z) - CG3D: Compositional Generation for Text-to-3D via Gaussian Splatting [57.14748263512924]
CG3D is a method for compositionally generating scalable 3D assets.
Gamma radiance fields, parameterized to allow for compositions of objects, possess the capability to enable semantically and physically consistent scenes.
arXiv Detail & Related papers (2023-11-29T18:55:38Z) - Variable Radiance Field for Real-World Category-Specific Reconstruction from Single Image [25.44715538841181]
Reconstructing category-specific objects using Neural Radiance Field (NeRF) from a single image is a promising yet challenging task.<n>We propose Variable Radiance Field (VRF), a novel framework capable of efficiently reconstructing category-specific objects.<n>VRF achieves state-of-the-art performance in both reconstruction quality and computational efficiency.
arXiv Detail & Related papers (2023-06-08T12:12:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.