Video Autoencoder: self-supervised disentanglement of static 3D
structure and motion
- URL: http://arxiv.org/abs/2110.02951v1
- Date: Wed, 6 Oct 2021 17:57:42 GMT
- Title: Video Autoencoder: self-supervised disentanglement of static 3D
structure and motion
- Authors: Zihang Lai, Sifei Liu, Alexei A. Efros, Xiaolong Wang
- Abstract summary: A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos.
The representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following.
- Score: 60.58836145375273
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A video autoencoder is proposed for learning disentan- gled representations
of 3D structure and camera pose from videos in a self-supervised manner.
Relying on temporal continuity in videos, our work assumes that the 3D scene
structure in nearby video frames remains static. Given a sequence of video
frames as input, the video autoencoder extracts a disentangled representation
of the scene includ- ing: (i) a temporally-consistent deep voxel feature to
represent the 3D structure and (ii) a 3D trajectory of camera pose for each
frame. These two representations will then be re-entangled for rendering the
input video frames. This video autoencoder can be trained directly using a
pixel reconstruction loss, without any ground truth 3D or camera pose
annotations. The disentangled representation can be applied to a range of
tasks, including novel view synthesis, camera pose estimation, and video
generation by motion following. We evaluate our method on several large- scale
natural video datasets, and show generalization results on out-of-domain
images.
Related papers
- CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation [76.72787726497343]
We present CineMaster, a framework for 3D-aware and controllable text-to-video generation.
Our goal is to empower users with comparable controllability as professional film directors.
arXiv Detail & Related papers (2025-02-12T18:55:36Z) - VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment [62.6737516863285]
VideoLifter is a novel framework that incrementally optimize a globally sparse to dense 3D representation directly from video sequences.
By tracking and propagating sparse point correspondences across frames and fragments, VideoLifter incrementally refines camera poses and 3D structure.
This approach significantly accelerates the reconstruction process, reducing training time by over 82% while surpassing current state-of-the-art methods in visual fidelity and computational efficiency.
arXiv Detail & Related papers (2025-01-03T18:52:36Z) - Generating 3D-Consistent Videos from Unposed Internet Photos [68.944029293283]
We train a scalable, 3D-aware video model without any 3D annotations such as camera parameters.
Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
arXiv Detail & Related papers (2024-11-20T18:58:31Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses
via Pixel-Aligned Scene Flow [26.528667940013598]
Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning.
Key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion.
We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass.
arXiv Detail & Related papers (2023-05-31T20:58:46Z) - Unsupervised Video Prediction from a Single Frame by Estimating 3D
Dynamic Scene Structure [42.3091008598491]
We develop a model that first estimates the latent 3D structure of the scene, including the segmentation of any moving objects.
It then predicts future frames by simulating the object and camera dynamics, and rendering the resulting views.
Experiments on two challenging datasets of natural videos show that our model can estimate 3D structure and motion segmentation from a single frame.
arXiv Detail & Related papers (2021-06-16T18:00:12Z) - Online Adaptation for Consistent Mesh Reconstruction in the Wild [147.22708151409765]
We pose video-based reconstruction as a self-supervised online adaptation problem applied to any incoming test video.
We demonstrate that our algorithm recovers temporally consistent and reliable 3D structures from videos of non-rigid objects including those of animals captured in the wild.
arXiv Detail & Related papers (2020-12-06T07:22:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.