DynaVol: Unsupervised Learning for Dynamic Scenes through Object-Centric
Voxelization
- URL: http://arxiv.org/abs/2305.00393v4
- Date: Fri, 26 Jan 2024 07:24:31 GMT
- Title: DynaVol: Unsupervised Learning for Dynamic Scenes through Object-Centric
Voxelization
- Authors: Yanpeng Zhao, Siyu Gao, Yunbo Wang, Xiaokang Yang
- Abstract summary: We present DynaVol, a 3D scene generative model that unifies geometric structures and object-centric learning.
The key idea is to perform object-centric voxelization to capture the 3D nature of the scene.
voxel features evolve over time through a canonical-space deformation function, forming the basis for global representation learning.
- Score: 67.85434518679382
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised learning of object-centric representations in dynamic visual
scenes is challenging. Unlike most previous approaches that learn to decompose
2D images, we present DynaVol, a 3D scene generative model that unifies
geometric structures and object-centric learning in a differentiable volume
rendering framework. The key idea is to perform object-centric voxelization to
capture the 3D nature of the scene, which infers the probability distribution
over objects at individual spatial locations. These voxel features evolve over
time through a canonical-space deformation function, forming the basis for
global representation learning via slot attention. The voxel features and
global features are complementary and are both leveraged by a compositional
NeRF decoder for volume rendering. DynaVol remarkably outperforms existing
approaches for unsupervised dynamic scene decomposition. Once trained, the
explicitly meaningful voxel features enable additional capabilities that 2D
scene decomposition methods cannot achieve: it is possible to freely edit the
geometric shapes or manipulate the motion trajectories of the objects.
Related papers
- Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning.
voxelization infers per-object occupancy probabilities at individual spatial locations.
Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z) - Variational Inference for Scalable 3D Object-centric Learning [19.445804699433353]
We tackle the task of scalable unsupervised object-centric representation learning on 3D scenes.
Existing approaches to object-centric representation learning show limitations in generalizing to larger scenes.
We propose to learn view-invariant 3D object representations in localized object coordinate systems.
arXiv Detail & Related papers (2023-09-25T10:23:40Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - Object Scene Representation Transformer [56.40544849442227]
We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis.
OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods.
It is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder.
arXiv Detail & Related papers (2022-06-14T15:40:47Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.