WALDO: Future Video Synthesis using Object Layer Decomposition and
Parametric Flow Prediction
- URL: http://arxiv.org/abs/2211.14308v3
- Date: Tue, 29 Aug 2023 07:58:49 GMT
- Title: WALDO: Future Video Synthesis using Object Layer Decomposition and
Parametric Flow Prediction
- Authors: Guillaume Le Moing and Jean Ponce and Cordelia Schmid
- Abstract summary: WALDO is a novel approach to the prediction of future video frames from past ones.
Individual images are decomposed into multiple layers combining object masks and a small set of control points.
The layer structure is shared across all frames in each video to build dense inter-frame connections.
- Score: 82.79642869586587
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents WALDO (WArping Layer-Decomposed Objects), a novel
approach to the prediction of future video frames from past ones. Individual
images are decomposed into multiple layers combining object masks and a small
set of control points. The layer structure is shared across all frames in each
video to build dense inter-frame connections. Complex scene motions are modeled
by combining parametric geometric transformations associated with individual
layers, and video synthesis is broken down into discovering the layers
associated with past frames, predicting the corresponding transformations for
upcoming ones and warping the associated object regions accordingly, and
filling in the remaining image parts. Extensive experiments on multiple
benchmarks including urban videos (Cityscapes and KITTI) and videos featuring
nonrigid motions (UCF-Sports and H3.6M), show that our method consistently
outperforms the state of the art by a significant margin in every case. Code,
pretrained models, and video samples synthesized by our approach can be found
in the project webpage https://16lemoing.github.io/waldo.
Related papers
- Explorative Inbetweening of Time and Space [46.77750028273578]
We introduce bounded generation to control video generation based only on a given start and end frame.
Time Reversal Fusion fuses the temporally forward and backward denoising paths conditioned on the start and end frame.
We find that Time Reversal Fusion outperforms related work on all subtasks.
arXiv Detail & Related papers (2024-03-21T17:57:31Z) - FusionFrames: Efficient Architectural Aspects for Text-to-Video
Generation Pipeline [4.295130967329365]
This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model.
The design of our model significantly reduces computational costs compared to other masked frame approaches.
We evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores.
arXiv Detail & Related papers (2023-11-22T00:26:15Z) - CoDeF: Content Deformation Fields for Temporally Consistent Video
Processing [89.49585127724941]
CoDeF is a new type of video representation, which consists of a canonical content field and a temporal deformation field.
We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.
arXiv Detail & Related papers (2023-08-15T17:59:56Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - Unsupervised Video Interpolation by Learning Multilayered 2.5D Motion
Fields [75.81417944207806]
This paper presents a self-supervised approach to video frame learning that requires only a single video.
We parameterize the video motions by solving an ordinary differentiable equation (ODE) defined on a time-varying motion field.
This implicit neural representation learns the video as a space-time continuum, allowing frame-time continuum at any temporal resolution.
arXiv Detail & Related papers (2022-04-21T06:17:05Z) - Layered Neural Atlases for Consistent Video Editing [37.69447642502351]
We present a method that decomposes, or "unwraps", an input video into a set of layered 2D atlases.
For each pixel in the video, our method estimates its corresponding 2D coordinate in each of the atlases.
We design our atlases to be interpretable and semantic, which facilitates easy and intuitive editing in the atlas domain.
arXiv Detail & Related papers (2021-09-23T14:58:59Z) - Street-view Panoramic Video Synthesis from a Single Satellite Image [92.26826861266784]
We present a novel method for synthesizing both temporally and geometrically consistent street-view panoramic video.
Existing cross-view synthesis approaches focus more on images, while video synthesis in such a case has not yet received enough attention.
arXiv Detail & Related papers (2020-12-11T20:22:38Z) - Future Video Synthesis with Object Motion Prediction [54.31508711871764]
Instead of synthesizing images directly, our approach is designed to understand the complex scene dynamics.
The appearance of the scene components in the future is predicted by non-rigid deformation of the background and affine transformation of moving objects.
Experimental results on the Cityscapes and KITTI datasets show that our model outperforms the state-of-the-art in terms of visual quality and accuracy.
arXiv Detail & Related papers (2020-04-01T16:09:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.