MAGVIT: Masked Generative Video Transformer
- URL: http://arxiv.org/abs/2212.05199v2
- Date: Wed, 5 Apr 2023 02:32:59 GMT
- Title: MAGVIT: Masked Generative Video Transformer
- Authors: Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos\'e Lezama, Han Zhang, Huiwen
Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu
Jiang
- Abstract summary: We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model.
A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains.
- Score: 129.50814875955444
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle
various video synthesis tasks with a single model. We introduce a 3D tokenizer
to quantize a video into spatial-temporal visual tokens and propose an
embedding method for masked video token modeling to facilitate multi-task
learning. We conduct extensive experiments to demonstrate the quality,
efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT
performs favorably against state-of-the-art approaches and establishes the
best-published FVD on three video generation benchmarks, including the
challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference
time by two orders of magnitude against diffusion models and by 60x against
autoregressive models. (iii) A single MAGVIT model supports ten diverse
generation tasks and generalizes across videos from different visual domains.
The source code and trained models will be released to the public at
https://magvit.cs.cmu.edu.
Related papers
- Vivid-ZOO: Multi-View Video Generation with Diffusion Model [76.96449336578286]
New challenges lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution.
We propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text.
arXiv Detail & Related papers (2024-06-12T21:44:04Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - VDT: General-purpose Video Diffusion Transformers via Mask Modeling [62.71878864360634]
Video Diffusion Transformer (VDT) pioneers the use of transformers in diffusion-based video generation.
We propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.
arXiv Detail & Related papers (2023-05-22T17:59:45Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - Masked Video Distillation: Rethinking Masked Feature Modeling for
Self-supervised Video Representation Learning [123.63301596019522]
Masked video distillation (MVD) is a simple yet effective two-stage masked feature modeling framework for video representation learning.
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks.
We design a spatial-temporal co-teaching method for MVD to leverage the advantage of different teachers.
arXiv Detail & Related papers (2022-12-08T18:59:59Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.