Diffusion-based Generation, Optimization, and Planning in 3D Scenes
- URL: http://arxiv.org/abs/2301.06015v1
- Date: Sun, 15 Jan 2023 03:43:45 GMT
- Title: Diffusion-based Generation, Optimization, and Planning in 3D Scenes
- Authors: Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu,
Wei Liang, Song-Chun Zhu
- Abstract summary: We introduce SceneDiffuser, a conditional generative model for 3D scene understanding.
SceneDiffuser is intrinsically scene-aware, physics-based, and goal-oriented.
We show significant improvements compared with previous models.
- Score: 89.63179422011254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce SceneDiffuser, a conditional generative model for 3D scene
understanding. SceneDiffuser provides a unified model for solving
scene-conditioned generation, optimization, and planning. In contrast to prior
works, SceneDiffuser is intrinsically scene-aware, physics-based, and
goal-oriented. With an iterative sampling strategy, SceneDiffuser jointly
formulates the scene-aware generation, physics-based optimization, and
goal-oriented planning via a diffusion-based denoising process in a fully
differentiable fashion. Such a design alleviates the discrepancies among
different modules and the posterior collapse of previous scene-conditioned
generative models. We evaluate SceneDiffuser with various 3D scene
understanding tasks, including human pose and motion generation, dexterous
grasp generation, path planning for 3D navigation, and motion planning for
robot arms. The results show significant improvements compared with previous
models, demonstrating the tremendous potential of SceneDiffuser for the broad
community of 3D scene understanding.
Related papers
- CasaGPT: Cuboid Arrangement and Scene Assembly for Interior Design [35.11283253765395]
We present a novel approach for indoor scene synthesis, which learns to arrange decomposed cuboid primitives to represent 3D objects within a scene.
Our approach, coined CasaGPT for Cuboid Arrangement and Scene Assembly, employs an autoregressive model to sequentially arrange cuboids, producing physically plausible scenes.
arXiv Detail & Related papers (2025-04-28T04:35:04Z) - 3D Scene Understanding Through Local Random Access Sequence Modeling [12.689247678229382]
3D scene understanding from single images is a pivotal problem in computer vision.
We propose an autoregressive generative approach called Local Random Access Sequence (LRAS) modeling.
By utilizing optical flow as an intermediate representation for 3D scene editing, our experiments demonstrate that LRAS achieves state-of-the-art novel view synthesis and 3D object manipulation capabilities.
arXiv Detail & Related papers (2025-04-04T18:59:41Z) - Layout2Scene: 3D Semantic Layout Guided Scene Generation via Geometry and Appearance Diffusion Priors [52.63385546943866]
We present a text-to-scene generation method (namely, Layout2Scene) using additional semantic layout as the prompt to inject precise control of 3D object positions.
To fully leverage 2D diffusion priors in geometry and appearance generation, we introduce a semantic-guided geometry diffusion model and a semantic-geometry guided diffusion model.
Our method can generate more plausible and realistic scenes as compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-01-05T12:20:13Z) - Wonderland: Navigating 3D Scenes from a Single Image [43.99037613068823]
We introduce a large-scale reconstruction model that leverages latents from a video diffusion model to predict 3D Gaussian Splattings of scenes in a feed-forward manner.
We train the 3D reconstruction model to operate on the video latent space with a progressive learning strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes.
arXiv Detail & Related papers (2024-12-16T18:58:17Z) - Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation [54.60804602905519]
We learn an entangled representation, aiming to model layered scene geometry, motion forecasting and novel view synthesis together.
Our approach chooses to disentangle scene geometry from scene motion, via lifting the 2D scene to 3D point clouds.
To model future 3D scene motion, we propose a disentangled two-stage approach that initially forecasts ego-motion and subsequently the residual motion of dynamic objects.
arXiv Detail & Related papers (2024-07-31T08:54:50Z) - Mixed Diffusion for 3D Indoor Scene Synthesis [55.94569112629208]
We present MiDiffusion, a novel mixed discrete-continuous diffusion model architecture.
We represent a scene layout by a 2D floor plan and a set of objects, each defined by its category, location, size, and orientation.
Our experimental results demonstrate that MiDiffusion substantially outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis.
arXiv Detail & Related papers (2024-05-31T17:54:52Z) - 3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation [51.64796781728106]
We propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior to 2D diffusion model and the global 3D information of the current scene.
Our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
arXiv Detail & Related papers (2024-03-14T14:31:22Z) - SceneWiz3D: Towards Text-guided 3D Scene Composition [134.71933134180782]
Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets.
We introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text.
arXiv Detail & Related papers (2023-12-13T18:59:30Z) - DORSal: Diffusion for Object-centric Representations of Scenes et al [28.181157214966493]
Recent progress in 3D scene understanding enables scalable learning of representations across large datasets of diverse scenes.
We propose DORSal, which adapts a video diffusion architecture for 3D scene generation conditioned on frozen object-centric slot-based representations of scenes.
arXiv Detail & Related papers (2023-06-13T18:32:35Z) - GAUDI: A Neural Architect for Immersive 3D Scene Generation [67.97817314857917]
GAUDI is a generative model capable of capturing the distribution of complex and realistic 3D scenes that can be rendered immersively from a moving camera.
We show that GAUDI obtains state-of-the-art performance in the unconditional generative setting across multiple datasets.
arXiv Detail & Related papers (2022-07-27T19:10:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.