Related papers: Video Autoencoder: self-supervised disentanglement of static 3D structure and motion

Video Autoencoder: self-supervised disentanglement of static 3D structure and motion

URL: http://arxiv.org/abs/2110.02951v1
Date: Wed, 6 Oct 2021 17:57:42 GMT
Title: Video Autoencoder: self-supervised disentanglement of static 3D structure and motion
Authors: Zihang Lai, Sifei Liu, Alexei A. Efros, Xiaolong Wang
Abstract summary: A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos. The representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following.
Score: 60.58836145375273
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos in a self-supervised manner. Relying on temporal continuity in videos, our work assumes that the 3D scene structure in nearby video frames remains static. Given a sequence of video frames as input, the video autoencoder extracts a disentangled representation of the scene includ- ing: (i) a temporally-consistent deep voxel feature to represent the 3D structure and (ii) a 3D trajectory of camera pose for each frame. These two representations will then be re-entangled for rendering the input video frames. This video autoencoder can be trained directly using a pixel reconstruction loss, without any ground truth 3D or camera pose annotations. The disentangled representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following. We evaluate our method on several large- scale natural video datasets, and show generalization results on out-of-domain images.

Related papers

CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation [76.72787726497343]
We present CineMaster, a framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors.
arXiv Detail & Related papers (2025-02-12T18:55:36Z)
Generating 3D-Consistent Videos from Unposed Internet Photos [68.944029293283]
We train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
arXiv Detail & Related papers (2024-11-20T18:58:31Z)
AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow [26.528667940013598]
Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning. Key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion. We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass.
arXiv Detail & Related papers (2023-05-31T20:58:46Z)
Tracking by 3D Model Estimation of Unknown Objects in Videos [122.56499878291916]
We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation. Our representation tackles a complex long-term dense correspondence problem between all 3D points on the object for all video frames. The proposed optimization minimizes a novel loss function to estimate the best 3D shape, texture, and 6DoF pose.
arXiv Detail & Related papers (2023-04-13T11:32:36Z)
NeuralDiff: Segmenting 3D objects that move in egocentric videos [92.95176458079047]
We study the problem of decomposing the observed 3D scene into a static background and a dynamic foreground. This task is reminiscent of the classic background subtraction problem, but is significantly harder because all parts of the scene, static and dynamic, generate a large apparent motion. In particular, we consider egocentric videos and further separate the dynamic component into objects and the actor that observes and moves them.
arXiv Detail & Related papers (2021-10-19T12:51:35Z)
Unsupervised Video Prediction from a Single Frame by Estimating 3D Dynamic Scene Structure [42.3091008598491]
We develop a model that first estimates the latent 3D structure of the scene, including the segmentation of any moving objects. It then predicts future frames by simulating the object and camera dynamics, and rendering the resulting views. Experiments on two challenging datasets of natural videos show that our model can estimate 3D structure and motion segmentation from a single frame.
arXiv Detail & Related papers (2021-06-16T18:00:12Z)
Online Adaptation for Consistent Mesh Reconstruction in the Wild [147.22708151409765]
We pose video-based reconstruction as a self-supervised online adaptation problem applied to any incoming test video. We demonstrate that our algorithm recovers temporally consistent and reliable 3D structures from videos of non-rigid objects including those of animals captured in the wild.
arXiv Detail & Related papers (2020-12-06T07:22:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.