SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
- URL: http://arxiv.org/abs/2512.10957v1
- Date: Thu, 11 Dec 2025 18:59:56 GMT
- Title: SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
- Authors: Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen, Ping Tan, Lei Zhang,
- Abstract summary: We propose a decoupled 3D scene generation framework called SceneMaker in this work.<n>We first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets.<n>Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy.
- Score: 83.37403036061403
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.
Related papers
- SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation [32.15143378003745]
occlusion reasoning is essential for partially occluded objects with depth-consistent geometry and scale.<n>We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions.
arXiv Detail & Related papers (2026-02-26T18:59:05Z) - One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation [32.45730375971019]
Estimating the 6D pose of arbitrary unseen objects from a single reference image is critical for robotics operating in the long-tail of real-world instances.<n>We propose OnePoseViaGen, a pipeline that tackles these challenges through two key components.<n>We demonstrate robust dexterous grasping with a real robot hand, validating the practicality of our method in real-world manipulation.
arXiv Detail & Related papers (2025-09-09T17:59:02Z) - DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion [50.90541069907167]
We propose DeOcc-1-to-3, an end-to-end framework for occlusion-aware multi-view generation.<n>Our self-supervised training pipeline leverages occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency.
arXiv Detail & Related papers (2025-06-26T17:58:26Z) - CasaGPT: Cuboid Arrangement and Scene Assembly for Interior Design [35.11283253765395]
We present a novel approach for indoor scene synthesis, which learns to arrange decomposed cuboid primitives to represent 3D objects within a scene.<n>Our approach, coined CasaGPT for Cuboid Arrangement and Scene Assembly, employs an autoregressive model to sequentially arrange cuboids, producing physically plausible scenes.
arXiv Detail & Related papers (2025-04-28T04:35:04Z) - SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views [36.02533658048349]
We propose a novel method, SpaRP, to reconstruct a 3D textured mesh and estimate the relative camera poses for sparse-view images.
SpaRP distills knowledge from 2D diffusion models and finetunes them to implicitly deduce the 3D spatial relationships between the sparse views.
It requires only about 20 seconds to produce a textured mesh and camera poses for the input views.
arXiv Detail & Related papers (2024-08-19T17:53:10Z) - Mixed Diffusion for 3D Indoor Scene Synthesis [55.94569112629208]
We present MiDiffusion, a novel mixed discrete-continuous diffusion model designed to synthesize plausible 3D indoor scenes.<n>We show it outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis.
arXiv Detail & Related papers (2024-05-31T17:54:52Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - Occ$^2$Net: Robust Image Matching Based on 3D Occupancy Estimation for
Occluded Regions [14.217367037250296]
Occ$2$Net is an image matching method that models occlusion relations using 3D occupancy and infers matching points in occluded regions.
We evaluate our method on both real-world and simulated datasets and demonstrate its superior performance over state-of-the-art methods on several metrics.
arXiv Detail & Related papers (2023-08-14T13:09:41Z) - Visual Localization using Imperfect 3D Models from the Internet [54.731309449883284]
This paper studies how imperfections in 3D models affect localization accuracy.
We show that 3D models from the Internet show promise as an easy-to-obtain scene representation.
arXiv Detail & Related papers (2023-04-12T16:15:05Z) - GAUDI: A Neural Architect for Immersive 3D Scene Generation [67.97817314857917]
GAUDI is a generative model capable of capturing the distribution of complex and realistic 3D scenes that can be rendered immersively from a moving camera.
We show that GAUDI obtains state-of-the-art performance in the unconditional generative setting across multiple datasets.
arXiv Detail & Related papers (2022-07-27T19:10:32Z) - MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision [72.5863451123577]
We show how to train a neural model that can perform accurate 3D pose and camera estimation.
Our method outperforms both classical bundle adjustment and weakly-supervised monocular 3D baselines.
arXiv Detail & Related papers (2021-08-10T18:39:56Z) - CAMPARI: Camera-Aware Decomposed Generative Neural Radiance Fields [67.76151996543588]
We learn a 3D- and camera-aware generative model which faithfully recovers not only the image but also the camera data distribution.
At test time, our model generates images with explicit control over the camera as well as the shape and appearance of the scene.
arXiv Detail & Related papers (2021-03-31T17:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.