Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators
- URL: http://arxiv.org/abs/2303.13439v1
- Date: Thu, 23 Mar 2023 17:01:59 GMT
- Title: Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators
- Authors: Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto
Henschel, Zhangyang Wang, Shant Navasardyan, Humphrey Shi
- Abstract summary: Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
- Score: 70.17041424896507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent text-to-video generation approaches rely on computationally heavy
training and require large-scale video datasets. In this paper, we introduce a
new task of zero-shot text-to-video generation and propose a low-cost approach
(without any training or optimization) by leveraging the power of existing
text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable
for the video domain.
Our key modifications include (i) enriching the latent codes of the generated
frames with motion dynamics to keep the global scene and the background time
consistent; and (ii) reprogramming frame-level self-attention using a new
cross-frame attention of each frame on the first frame, to preserve the
context, appearance, and identity of the foreground object.
Experiments show that this leads to low overhead, yet high-quality and
remarkably consistent video generation. Moreover, our approach is not limited
to text-to-video synthesis but is also applicable to other tasks such as
conditional and content-specialized video generation, and Video
Instruct-Pix2Pix, i.e., instruction-guided video editing.
As experiments show, our method performs comparably or sometimes better than
recent approaches, despite not being trained on additional video data. Our code
will be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero .
Related papers
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM.
We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Sketching the Future (STF): Applying Conditional Control Techniques to
Text-to-Video Models [0.0]
We propose a novel approach that combines zero-shot text-to-video generation with ControlNet to improve the output of these models.
Our method takes multiple sketched frames as input and generates video output that matches the flow of these frames.
arXiv Detail & Related papers (2023-05-10T02:33:25Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z) - TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary
Generator [34.7504057664375]
We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video.
Step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions.
arXiv Detail & Related papers (2020-09-04T06:33:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.