Video Autoencoder: self-supervised disentanglement of static 3D
structure and motion
- URL: http://arxiv.org/abs/2110.02951v1
- Date: Wed, 6 Oct 2021 17:57:42 GMT
- Title: Video Autoencoder: self-supervised disentanglement of static 3D
structure and motion
- Authors: Zihang Lai, Sifei Liu, Alexei A. Efros, Xiaolong Wang
- Abstract summary: A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos.
The representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following.
- Score: 60.58836145375273
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A video autoencoder is proposed for learning disentan- gled representations
of 3D structure and camera pose from videos in a self-supervised manner.
Relying on temporal continuity in videos, our work assumes that the 3D scene
structure in nearby video frames remains static. Given a sequence of video
frames as input, the video autoencoder extracts a disentangled representation
of the scene includ- ing: (i) a temporally-consistent deep voxel feature to
represent the 3D structure and (ii) a 3D trajectory of camera pose for each
frame. These two representations will then be re-entangled for rendering the
input video frames. This video autoencoder can be trained directly using a
pixel reconstruction loss, without any ground truth 3D or camera pose
annotations. The disentangled representation can be applied to a range of
tasks, including novel view synthesis, camera pose estimation, and video
generation by motion following. We evaluate our method on several large- scale
natural video datasets, and show generalization results on out-of-domain
images.
Related papers
- Generating 3D-Consistent Videos from Unposed Internet Photos [68.944029293283]
We train a scalable, 3D-aware video model without any 3D annotations such as camera parameters.
Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
arXiv Detail & Related papers (2024-11-20T18:58:31Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses
via Pixel-Aligned Scene Flow [26.528667940013598]
Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning.
Key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion.
We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass.
arXiv Detail & Related papers (2023-05-31T20:58:46Z) - Tracking by 3D Model Estimation of Unknown Objects in Videos [122.56499878291916]
We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation.
Our representation tackles a complex long-term dense correspondence problem between all 3D points on the object for all video frames.
The proposed optimization minimizes a novel loss function to estimate the best 3D shape, texture, and 6DoF pose.
arXiv Detail & Related papers (2023-04-13T11:32:36Z) - NeuralDiff: Segmenting 3D objects that move in egocentric videos [92.95176458079047]
We study the problem of decomposing the observed 3D scene into a static background and a dynamic foreground.
This task is reminiscent of the classic background subtraction problem, but is significantly harder because all parts of the scene, static and dynamic, generate a large apparent motion.
In particular, we consider egocentric videos and further separate the dynamic component into objects and the actor that observes and moves them.
arXiv Detail & Related papers (2021-10-19T12:51:35Z) - Unsupervised Video Prediction from a Single Frame by Estimating 3D
Dynamic Scene Structure [42.3091008598491]
We develop a model that first estimates the latent 3D structure of the scene, including the segmentation of any moving objects.
It then predicts future frames by simulating the object and camera dynamics, and rendering the resulting views.
Experiments on two challenging datasets of natural videos show that our model can estimate 3D structure and motion segmentation from a single frame.
arXiv Detail & Related papers (2021-06-16T18:00:12Z) - Online Adaptation for Consistent Mesh Reconstruction in the Wild [147.22708151409765]
We pose video-based reconstruction as a self-supervised online adaptation problem applied to any incoming test video.
We demonstrate that our algorithm recovers temporally consistent and reliable 3D structures from videos of non-rigid objects including those of animals captured in the wild.
arXiv Detail & Related papers (2020-12-06T07:22:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.