CogVideo: Large-scale Pretraining for Text-to-Video Generation via
Transformers
- URL: http://arxiv.org/abs/2205.15868v1
- Date: Sun, 29 May 2022 19:02:15 GMT
- Title: CogVideo: Large-scale Pretraining for Text-to-Video Generation via
Transformers
- Authors: Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang
- Abstract summary: Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation.
CogVideo is trained by inheriting a pretrained text-to-image model, CogView2.
CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.
- Score: 16.255516347736535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale pretrained transformers have created milestones in text (GPT-3)
and text-to-image (DALL-E and CogView) generation. Its application to video
generation is still facing many challenges: The potential huge computation cost
makes the training from scratch unaffordable; The scarcity and weak relevance
of text-video datasets hinder the model understanding complex movement
semantics. In this work, we present 9B-parameter transformer CogVideo, trained
by inheriting a pretrained text-to-image model, CogView2. We also propose
multi-frame-rate hierarchical training strategy to better align text and video
clips. As (probably) the first open-source large-scale pretrained text-to-video
model, CogVideo outperforms all publicly available models at a large margin in
machine and human evaluations.
Related papers
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer.
It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z) - A Recipe for Scaling up Text-to-Video Generation with Text-free Videos [72.59262815400928]
Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation.
We come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos.
arXiv Detail & Related papers (2023-12-25T16:37:39Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Contrastive Video Question Answering via Video Graph Transformer [184.3679515511028]
We propose a Video Graph Transformer model (CoVGT) to perform question answering (VideoQA) in a Contrastive manner.
CoVGT's uniqueness and superiority are three-fold.
We show that CoVGT can achieve much better performances than previous arts on video reasoning tasks.
arXiv Detail & Related papers (2023-02-27T11:09:13Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z) - Generative Video Transformer: Can Objects be the Words? [22.788711301106765]
We propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer.
By factoring video into objects, our fully unsupervised model is able to learn complex-temporal dynamics of multiple objects in a scene and generate future frames of the video.
Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU.
arXiv Detail & Related papers (2021-07-20T03:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.