VideoGPT: Video Generation using VQ-VAE and Transformers
- URL: http://arxiv.org/abs/2104.10157v1
- Date: Tue, 20 Apr 2021 17:58:03 GMT
- Title: VideoGPT: Video Generation using VQ-VAE and Transformers
- Authors: Wilson Yan, Yunzhi Zhang, Pieter Abbeel, Aravind Srinivas
- Abstract summary: VideoGG is a conceptually simple architecture for scaling likelihood based generative modeling to natural videos.
VideoG uses VQ-E that learns downsampled discrete latent representations by employing 3D convolutions and axial self-attention.
Our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the B-101 Robot dataset.
- Score: 75.20543171520565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present VideoGPT: a conceptually simple architecture for scaling
likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE
that learns downsampled discrete latent representations of a raw video by
employing 3D convolutions and axial self-attention. A simple GPT-like
architecture is then used to autoregressively model the discrete latents using
spatio-temporal position encodings. Despite the simplicity in formulation and
ease of training, our architecture is able to generate samples competitive with
state-of-the-art GAN models for video generation on the BAIR Robot dataset, and
generate high fidelity natural images from UCF-101 and Tumbler GIF Dataset
(TGIF). We hope our proposed architecture serves as a reproducible reference
for a minimalistic implementation of transformer based video generation models.
Samples and code are available at
https://wilson1yan.github.io/videogpt/index.html
Related papers
- JPEG-LM: LLMs as Image Generators with Canonical Codec Representations [51.097213824684665]
Discretization represents continuous data like images and videos as discrete tokens.
Common methods of discretizing images and videos include modeling raw pixel values.
We show that using canonical representations can help lower the barriers between language generation and visual generation.
arXiv Detail & Related papers (2024-08-15T23:57:02Z) - GenDeF: Learning Generative Deformation Field for Video Generation [89.49567113452396]
We propose to render a video by warping one static image with a generative deformation field (GenDeF)
Such a pipeline enjoys three appealing advantages.
arXiv Detail & Related papers (2023-12-07T18:59:41Z) - MagicVideo: Efficient Video Generation With Latent Diffusion Models [76.95903791630624]
We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo.
Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card.
We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content.
arXiv Detail & Related papers (2022-11-20T16:40:31Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z) - A strong baseline for image and video quality assessment [4.73466728067544]
We present a simple yet effective unified model for perceptual quality assessment of image and video.
Our model achieves a comparable performance by applying only one global feature derived from a backbone network.
Based on the architecture proposed, we release the models well trained for three common real-world scenarios.
arXiv Detail & Related papers (2021-11-13T12:24:08Z) - Generative Video Transformer: Can Objects be the Words? [22.788711301106765]
We propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer.
By factoring video into objects, our fully unsupervised model is able to learn complex-temporal dynamics of multiple objects in a scene and generate future frames of the video.
Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU.
arXiv Detail & Related papers (2021-07-20T03:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.