Make-A-Video: Text-to-Video Generation without Text-Video Data
- URL: http://arxiv.org/abs/2209.14792v1
- Date: Thu, 29 Sep 2022 13:59:46 GMT
- Title: Make-A-Video: Text-to-Video Generation without Text-Video Data
- Authors: Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang
Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal
Gupta, Yaniv Taigman
- Abstract summary: Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
- Score: 69.20996352229422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Make-A-Video -- an approach for directly translating the
tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video
(T2V). Our intuition is simple: learn what the world looks like and how it is
described from paired text-image data, and learn how the world moves from
unsupervised video footage. Make-A-Video has three advantages: (1) it
accelerates training of the T2V model (it does not need to learn visual and
multimodal representations from scratch), (2) it does not require paired
text-video data, and (3) the generated videos inherit the vastness (diversity
in aesthetic, fantastical depictions, etc.) of today's image generation models.
We design a simple yet effective way to build on T2I models with novel and
effective spatial-temporal modules. First, we decompose the full temporal U-Net
and attention tensors and approximate them in space and time. Second, we design
a spatial temporal pipeline to generate high resolution and frame rate videos
with a video decoder, interpolation model and two super resolution models that
can enable various applications besides T2V. In all aspects, spatial and
temporal resolution, faithfulness to text, and quality, Make-A-Video sets the
new state-of-the-art in text-to-video generation, as determined by both
qualitative and quantitative measures.
Related papers
- xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data [14.489919164476982]
High-quality (HQ) video synthesis is challenging because of the diverse and complex motions existed in real world.
Most existing works struggle to address this problem by collecting large-scale captions, which are inaccessible to the community.
We show that publicly available limited and low-quality (LQ) data are sufficient to train a HQ video generator without recaptioning or finetuning.
arXiv Detail & Related papers (2024-08-19T16:08:00Z) - CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer.
It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
Text-to-Video Generation [31.882356164068753]
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ massive dataset for dataset for T2V generation.
We propose Tune-A-Video is capable of producing temporally-coherent videos over various applications.
arXiv Detail & Related papers (2022-12-22T09:43:36Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.