Mixed Diffusion for 3D Indoor Scene Synthesis
- URL: http://arxiv.org/abs/2405.21066v2
- Date: Mon, 09 Dec 2024 22:33:30 GMT
- Title: Mixed Diffusion for 3D Indoor Scene Synthesis
- Authors: Siyi Hu, Diego Martin Arroyo, Stephanie Debats, Fabian Manhardt, Luca Carlone, Federico Tombari,
- Abstract summary: We present MiDiffusion, a novel mixed discrete-continuous diffusion model designed to synthesize plausible 3D indoor scenes.
We show it outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis.
- Score: 55.94569112629208
- License:
- Abstract: Generating realistic 3D scenes is an area of growing interest in computer vision and robotics. However, creating high-quality, diverse synthetic 3D content often requires expert intervention, making it costly and complex. Recently, efforts to automate this process with learning techniques, particularly diffusion models, have shown significant improvements in tasks like furniture rearrangement. However, applying diffusion models to floor-conditioned indoor scene synthesis remains under-explored. This task is especially challenging as it requires arranging objects in continuous space while selecting from discrete object categories, posing unique difficulties for conventional diffusion methods. To bridge this gap, we present MiDiffusion, a novel mixed discrete-continuous diffusion model designed to synthesize plausible 3D indoor scenes given a floor plan and pre-arranged objects. We represent a scene layout by a 2D floor plan and a set of objects, each defined by category, location, size, and orientation. Our approach uniquely applies structured corruption across mixed discrete semantic and continuous geometric domains, resulting in a better-conditioned problem for denoising. Evaluated on the 3D-FRONT dataset, MiDiffusion outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis. Additionally, it effectively handles partial object constraints via a corruption-and-masking strategy without task-specific training, demonstrating advantages in scene completion and furniture arrangement tasks.
Related papers
- 3D Spatial Understanding in MLLMs: Disambiguation and Evaluation [13.614206918726314]
We propose techniques to enhance the model's ability to localize and disambiguate target objects.
Our approach achieves state-of-the-art performance on conventional metrics that evaluate sentence similarity.
arXiv Detail & Related papers (2024-12-09T16:04:32Z) - DeBaRA: Denoising-Based 3D Room Arrangement Generation [22.96293773013579]
We introduce DeBaRA, a score-based model specifically tailored for precise, controllable and flexible arrangement generation in a bounded environment.
We demonstrate that by focusing on spatial attributes of objects, a single trained DeBaRA model can be leveraged at test time to perform several downstream applications such as scene synthesis, completion and re-arrangement.
arXiv Detail & Related papers (2024-09-26T23:18:25Z) - Object-level Scene Deocclusion [92.39886029550286]
We present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, for object-level scene deocclusion.
To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning.
Experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin.
arXiv Detail & Related papers (2024-06-11T20:34:10Z) - $\text{Di}^2\text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation [17.281031933210762]
We introduce the Discrete Diffusion Pose ($textDi2textPose$), a novel framework designed for occluded 3D human pose estimation.
$textDi2textPose$ employs a two-stage process: it first converts 3D poses into a discrete representation through a emphpose quantization step.
This methodological innovation restrictively confines the search space towards physically viable configurations.
arXiv Detail & Related papers (2024-05-27T10:01:36Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Zero123-6D: Zero-shot Novel View Synthesis for RGB Category-level 6D Pose Estimation [66.3814684757376]
This work presents Zero123-6D, the first work to demonstrate the utility of Diffusion Model-based novel-view-synthesizers in enhancing RGB 6D pose estimation at category-level.
The outlined method shows reduction in data requirements, removal of the necessity of depth information in zero-shot category-level 6D pose estimation task, and increased performance, quantitatively demonstrated through experiments on the CO3D dataset.
arXiv Detail & Related papers (2024-03-21T10:38:18Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - DiffuScene: Denoising Diffusion Models for Generative Indoor Scene
Synthesis [44.521452102413534]
We present DiffuScene for indoor 3D scene synthesis based on a novel scene configuration denoising diffusion model.
It generates 3D instance properties stored in an unordered object set and retrieves the most similar geometry for each object configuration.
arXiv Detail & Related papers (2023-03-24T18:00:15Z) - ATISS: Autoregressive Transformers for Indoor Scene Synthesis [112.63708524926689]
We present ATISS, a novel autoregressive transformer architecture for creating synthetic indoor environments.
We argue that this formulation is more natural, as it makes ATISS generally useful beyond fully automatic room layout synthesis.
Our model is trained end-to-end as an autoregressive generative model using only labeled 3D bounding boxes as supervision.
arXiv Detail & Related papers (2021-10-07T17:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.