Related papers: Deformable Sprites for Unsupervised Video Decomposition

Deformable Sprites for Unsupervised Video Decomposition

URL: http://arxiv.org/abs/2204.07151v1
Date: Thu, 14 Apr 2022 17:58:02 GMT
Title: Deformable Sprites for Unsupervised Video Decomposition
Authors: Vickie Ye, Zhengqi Li, Richard Tucker, Angjoo Kanazawa, Noah Snavely
Abstract summary: We represent each scene element as a emphDeformable Sprite consisting of three components. The resulting decomposition allows for applications such as consistent video editing.
Score: 66.73136214980309
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We describe a method to extract persistent elements of a dynamic scene from an input video. We represent each scene element as a \emph{Deformable Sprite} consisting of three components: 1) a 2D texture image for the entire video, 2) per-frame masks for the element, and 3) non-rigid deformations that map the texture image into each video frame. The resulting decomposition allows for applications such as consistent video editing. Deformable Sprites are a type of video auto-encoder model that is optimized on individual videos, and does not require training on a large dataset, nor does it rely on pre-trained models. Moreover, our method does not require object masks or other user input, and discovers moving objects of a wider variety than previous work. We evaluate our approach on standard video datasets and show qualitative results on a diverse array of Internet videos. Code and video results can be found at https://deformable-sprites.github.io

Related papers

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer. It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z)
Drag-A-Video: Non-rigid Video Editing with Point-based Interaction [63.78538355189017]
We propose a new diffusion-based method for interactive point-based video manipulation, called Drag-A-Video. Our method allows users to click pairs of handle points and target points as well as masks on the first frame of an input video. To precisely modify the contents of the video, we employ a new video-level motion supervision to update the features of the video.
arXiv Detail & Related papers (2023-12-05T18:05:59Z)
Hashing Neural Video Decomposition with Multiplicative Residuals in Space-Time [14.015909536844337]
We present a video decomposition method that facilitates layer-based editing of videos withtemporally varying lighting effects. Our method efficiently learns layer-based neural representations of a 1080p video in 25s per frame via coordinate hashing. We propose to adopt evaluation metrics for objectively assessing the consistency of video editing.
arXiv Detail & Related papers (2023-09-25T10:36:14Z)
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods. Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z)
MagicVideo: Efficient Video Generation With Latent Diffusion Models [76.95903791630624]
We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card. We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content.
arXiv Detail & Related papers (2022-11-20T16:40:31Z)
Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately. We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens. Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z)
Layered Neural Atlases for Consistent Video Editing [37.69447642502351]
We present a method that decomposes, or "unwraps", an input video into a set of layered 2D atlases. For each pixel in the video, our method estimates its corresponding 2D coordinate in each of the atlases. We design our atlases to be interpretable and semantic, which facilitates easy and intuitive editing in the atlas domain.
arXiv Detail & Related papers (2021-09-23T14:58:59Z)
Self-Supervised Equivariant Scene Synthesis from Video [84.15595573718925]
We propose a framework to learn scene representations from video that are automatically delineated into background, characters, and animations. After training, we can manipulate image encodings in real time to create unseen combinations of the delineated components. We demonstrate results on three datasets: Moving MNIST with backgrounds, 2D video game sprites, and Fashion Modeling.
arXiv Detail & Related papers (2021-02-01T14:17:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.