Related papers: CC-FMO: Camera-Conditioned Zero-Shot Single Image to 3D Scene Generation with Foundation Model Orchestration

CC-FMO: Camera-Conditioned Zero-Shot Single Image to 3D Scene Generation with Foundation Model Orchestration

URL: http://arxiv.org/abs/2512.00493v1
Date: Sat, 29 Nov 2025 14:01:13 GMT
Title: CC-FMO: Camera-Conditioned Zero-Shot Single Image to 3D Scene Generation with Foundation Model Orchestration
Authors: Boshi Tang, Henry Zheng, Rui Huang, Gao Huang,
Abstract summary: High-quality 3D scene generation from a single image is crucial for AR/VR and embodied AI applications.<n>This paper introduces CC-FMO, a zero-shot, camera-conditioned pipeline for single-image to 3D scene generation.
Score: 29.052223430061826
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: High-quality 3D scene generation from a single image is crucial for AR/VR and embodied AI applications. Early approaches struggle to generalize due to reliance on specialized models trained on curated small datasets. While recent advancements in large-scale 3D foundation models have significantly enhanced instance-level generation, coherent scene generation remains a challenge, where performance is limited by inaccurate per-object pose estimations and spatial inconsistency. To this end, this paper introduces CC-FMO, a zero-shot, camera-conditioned pipeline for single-image to 3D scene generation that jointly conforms to the object layout in input image and preserves instance fidelity. CC-FMO employs a hybrid instance generator that combines semantics-aware vector-set representation with detail-rich structured latent representation, yielding object geometries that are both semantically plausible and high-quality. Furthermore, CC-FMO enables the application of foundational pose estimation models in the scene generation task via a simple yet effective camera-conditioned scale-solving algorithm, to enforce scene-level coherence. Extensive experiments demonstrate that CC-FMO consistently generates high-fidelity camera-aligned compositional scenes, outperforming all state-of-the-art methods.

Related papers

FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation [50.71369329585773]
We introduce FACE, a novel Autoregressive Autoencoder framework that generates meshes at the face level.<n>Our one-face-one-token strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token.<n> FACE achieves state-of-the-art reconstruction quality on standard benchmarks.
arXiv Detail & Related papers (2026-03-02T06:47:15Z)
TRELLISWorld: Training-Free World Generation from Object Generators [13.962895984556582]
Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation.<n>Existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability.<n>We present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators.
arXiv Detail & Related papers (2025-10-27T21:40:31Z)
ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation [44.75113949778924]
ARTDECO is a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines.<n>We show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization.
arXiv Detail & Related papers (2025-10-09T17:57:38Z)
DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion [50.90541069907167]
We propose DeOcc-1-to-3, an end-to-end framework for occlusion-aware multi-view generation.<n>Our self-supervised training pipeline leverages occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency.
arXiv Detail & Related papers (2025-06-26T17:58:26Z)
3D Scene Understanding Through Local Random Access Sequence Modeling [12.689247678229382]
3D scene understanding from single images is a pivotal problem in computer vision.<n>We propose an autoregressive generative approach called Local Random Access Sequence (LRAS) modeling.<n>By utilizing optical flow as an intermediate representation for 3D scene editing, our experiments demonstrate that LRAS achieves state-of-the-art novel view synthesis and 3D object manipulation capabilities.
arXiv Detail & Related papers (2025-04-04T18:59:41Z)
LPA3D: 3D Room-Level Scene Generation from In-the-Wild Images [23.258004561060563]
We introduce LPA-GAN, a novel NeRF-based generative approach that incorporates specific modifications to estimate the priors of camera poses under LPA.<n>Our method achieves superior view-to-view consistency and semantic normality.
arXiv Detail & Related papers (2025-04-03T07:18:48Z)
HORT: Monocular Hand-held Objects Reconstruction with Transformers [61.36376511119355]
Reconstructing hand-held objects in 3D from monocular images is a significant challenge in computer vision.<n>We propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects.<n>Our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.
arXiv Detail & Related papers (2025-03-27T09:45:09Z)
FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction [69.63414788486578]
FreeSplatter is a scalable feed-forward framework that generates high-quality 3D Gaussians from uncalibrated sparse-view images.<n>Our approach employs a streamlined transformer architecture where self-attention blocks facilitate information exchange.<n>We develop two specialized variants--for object-centric and scene-level reconstruction--trained on comprehensive datasets.
arXiv Detail & Related papers (2024-12-12T18:52:53Z)
ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance [76.7746870349809]
We present ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models. Our proposed framework emphasizes spatial alignment of objects, compared with standard score distillation sampling.
arXiv Detail & Related papers (2024-03-19T03:39:43Z)
CG3D: Compositional Generation for Text-to-3D via Gaussian Splatting [57.14748263512924]
CG3D is a method for compositionally generating scalable 3D assets. Gamma radiance fields, parameterized to allow for compositions of objects, possess the capability to enable semantically and physically consistent scenes.
arXiv Detail & Related papers (2023-11-29T18:55:38Z)
Variable Radiance Field for Real-World Category-Specific Reconstruction from Single Image [25.44715538841181]
Reconstructing category-specific objects using Neural Radiance Field (NeRF) from a single image is a promising yet challenging task.<n>We propose Variable Radiance Field (VRF), a novel framework capable of efficiently reconstructing category-specific objects.<n>VRF achieves state-of-the-art performance in both reconstruction quality and computational efficiency.
arXiv Detail & Related papers (2023-06-08T12:12:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.