Deformable Sprites for Unsupervised Video Decomposition
- URL: http://arxiv.org/abs/2204.07151v1
- Date: Thu, 14 Apr 2022 17:58:02 GMT
- Title: Deformable Sprites for Unsupervised Video Decomposition
- Authors: Vickie Ye, Zhengqi Li, Richard Tucker, Angjoo Kanazawa, Noah Snavely
- Abstract summary: We represent each scene element as a emphDeformable Sprite consisting of three components.
The resulting decomposition allows for applications such as consistent video editing.
- Score: 66.73136214980309
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe a method to extract persistent elements of a dynamic scene from
an input video. We represent each scene element as a \emph{Deformable Sprite}
consisting of three components: 1) a 2D texture image for the entire video, 2)
per-frame masks for the element, and 3) non-rigid deformations that map the
texture image into each video frame. The resulting decomposition allows for
applications such as consistent video editing. Deformable Sprites are a type of
video auto-encoder model that is optimized on individual videos, and does not
require training on a large dataset, nor does it rely on pre-trained models.
Moreover, our method does not require object masks or other user input, and
discovers moving objects of a wider variety than previous work. We evaluate our
approach on standard video datasets and show qualitative results on a diverse
array of Internet videos. Code and video results can be found at
https://deformable-sprites.github.io
Related papers
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer.
It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z) - Drag-A-Video: Non-rigid Video Editing with Point-based Interaction [63.78538355189017]
We propose a new diffusion-based method for interactive point-based video manipulation, called Drag-A-Video.
Our method allows users to click pairs of handle points and target points as well as masks on the first frame of an input video.
To precisely modify the contents of the video, we employ a new video-level motion supervision to update the features of the video.
arXiv Detail & Related papers (2023-12-05T18:05:59Z) - Hashing Neural Video Decomposition with Multiplicative Residuals in
Space-Time [14.015909536844337]
We present a video decomposition method that facilitates layer-based editing of videos withtemporally varying lighting effects.
Our method efficiently learns layer-based neural representations of a 1080p video in 25s per frame via coordinate hashing.
We propose to adopt evaluation metrics for objectively assessing the consistency of video editing.
arXiv Detail & Related papers (2023-09-25T10:36:14Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - MagicVideo: Efficient Video Generation With Latent Diffusion Models [76.95903791630624]
We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo.
Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card.
We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content.
arXiv Detail & Related papers (2022-11-20T16:40:31Z) - Show Me What and Tell Me How: Video Synthesis via Multimodal
Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately.
We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens.
Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z) - Layered Neural Atlases for Consistent Video Editing [37.69447642502351]
We present a method that decomposes, or "unwraps", an input video into a set of layered 2D atlases.
For each pixel in the video, our method estimates its corresponding 2D coordinate in each of the atlases.
We design our atlases to be interpretable and semantic, which facilitates easy and intuitive editing in the atlas domain.
arXiv Detail & Related papers (2021-09-23T14:58:59Z) - Self-Supervised Equivariant Scene Synthesis from Video [84.15595573718925]
We propose a framework to learn scene representations from video that are automatically delineated into background, characters, and animations.
After training, we can manipulate image encodings in real time to create unseen combinations of the delineated components.
We demonstrate results on three datasets: Moving MNIST with backgrounds, 2D video game sprites, and Fashion Modeling.
arXiv Detail & Related papers (2021-02-01T14:17:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.