Deformable Sprites for Unsupervised Video Decomposition
- URL: http://arxiv.org/abs/2204.07151v1
- Date: Thu, 14 Apr 2022 17:58:02 GMT
- Title: Deformable Sprites for Unsupervised Video Decomposition
- Authors: Vickie Ye, Zhengqi Li, Richard Tucker, Angjoo Kanazawa, Noah Snavely
- Abstract summary: We represent each scene element as a emphDeformable Sprite consisting of three components.
The resulting decomposition allows for applications such as consistent video editing.
- Score: 66.73136214980309
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe a method to extract persistent elements of a dynamic scene from
an input video. We represent each scene element as a \emph{Deformable Sprite}
consisting of three components: 1) a 2D texture image for the entire video, 2)
per-frame masks for the element, and 3) non-rigid deformations that map the
texture image into each video frame. The resulting decomposition allows for
applications such as consistent video editing. Deformable Sprites are a type of
video auto-encoder model that is optimized on individual videos, and does not
require training on a large dataset, nor does it rely on pre-trained models.
Moreover, our method does not require object masks or other user input, and
discovers moving objects of a wider variety than previous work. We evaluate our
approach on standard video datasets and show qualitative results on a diverse
array of Internet videos. Code and video results can be found at
https://deformable-sprites.github.io
Related papers
- Lester: rotoscope animation through video object segmentation and
tracking [0.0]
Lester is a novel method to automatically synthetise retro-style 2D animations from videos.
Video frames are processed with the Segment Anything Model (SAM) and the resulting masks are tracked through subsequent frames with DeAOT.
Results show that the method exhibits an excellent temporal consistency and can correctly process videos with different poses and appearances.
arXiv Detail & Related papers (2024-02-15T11:15:54Z) - Drag-A-Video: Non-rigid Video Editing with Point-based Interaction [63.78538355189017]
We propose a new diffusion-based method for interactive point-based video manipulation, called Drag-A-Video.
Our method allows users to click pairs of handle points and target points as well as masks on the first frame of an input video.
To precisely modify the contents of the video, we employ a new video-level motion supervision to update the features of the video.
arXiv Detail & Related papers (2023-12-05T18:05:59Z) - Hashing Neural Video Decomposition with Multiplicative Residuals in
Space-Time [14.015909536844337]
We present a video decomposition method that facilitates layer-based editing of videos withtemporally varying lighting effects.
Our method efficiently learns layer-based neural representations of a 1080p video in 25s per frame via coordinate hashing.
We propose to adopt evaluation metrics for objectively assessing the consistency of video editing.
arXiv Detail & Related papers (2023-09-25T10:36:14Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - MagicVideo: Efficient Video Generation With Latent Diffusion Models [76.95903791630624]
We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo.
Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card.
We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content.
arXiv Detail & Related papers (2022-11-20T16:40:31Z) - Show Me What and Tell Me How: Video Synthesis via Multimodal
Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately.
We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens.
Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z) - Playable Environments: Video Manipulation in Space and Time [98.0621309257937]
We present Playable Environments - a new representation for interactive video generation and manipulation in space and time.
With a single image at inference time, our novel framework allows the user to move objects in 3D while generating a video by providing a sequence of desired actions.
Our method builds an environment state for each frame, which can be manipulated by our proposed action module and decoded back to the image space with volumetric rendering.
arXiv Detail & Related papers (2022-03-03T18:51:05Z) - Layered Neural Atlases for Consistent Video Editing [37.69447642502351]
We present a method that decomposes, or "unwraps", an input video into a set of layered 2D atlases.
For each pixel in the video, our method estimates its corresponding 2D coordinate in each of the atlases.
We design our atlases to be interpretable and semantic, which facilitates easy and intuitive editing in the atlas domain.
arXiv Detail & Related papers (2021-09-23T14:58:59Z) - Self-Supervised Equivariant Scene Synthesis from Video [84.15595573718925]
We propose a framework to learn scene representations from video that are automatically delineated into background, characters, and animations.
After training, we can manipulate image encodings in real time to create unseen combinations of the delineated components.
We demonstrate results on three datasets: Moving MNIST with backgrounds, 2D video game sprites, and Fashion Modeling.
arXiv Detail & Related papers (2021-02-01T14:17:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.