A Simple Text to Video Model via Transformer
- URL: http://arxiv.org/abs/2309.14683v1
- Date: Tue, 26 Sep 2023 05:26:30 GMT
- Title: A Simple Text to Video Model via Transformer
- Authors: Gang Chen
- Abstract summary: We present a general and simple text to video model based on Transformer.
Since both text and video are sequential data, we encode both texts and images into the same hidden space.
We use GPT2 and test our approach on UCF101 dataset and show it can generate promising videos.
- Score: 4.035107857147382
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present a general and simple text to video model based on Transformer.
Since both text and video are sequential data, we encode both texts and images
into the same hidden space, which are further fed into Transformer to capture
the temporal consistency and then decoder to generate either text or images.
Considering the image signal may become weak in the long sequence, we introduce
the U-Net to reconstruct image from its noised version. Specifically, we
increase the noise level to the original image in the long sequence, then use
the $down$ module from U-Net to encode noised images, which are further input
to transformer to predict next clear images. We also add a constraint to
promote motion between any generated image pair in the video. We use GPT2 and
test our approach on UCF101 dataset and show it can generate promising videos.
Related papers
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer.
It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z) - A Recipe for Scaling up Text-to-Video Generation with Text-free Videos [72.59262815400928]
Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation.
We come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos.
arXiv Detail & Related papers (2023-12-25T16:37:39Z) - SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models [84.71887272654865]
We present SparseCtrl to enable flexible structure control with temporally sparse signals.
It incorporates an additional condition to process these sparse signals while leaving the pre-trained T2V model untouched.
The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images.
arXiv Detail & Related papers (2023-11-28T16:33:08Z) - Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - Latent-Shift: Latent Diffusion with Temporal Shift for Efficient
Text-to-Video Generation [115.09597127418452]
Latent-Shift is an efficient text-to-video generation method based on a pretrained text-to-image generation model.
We show that Latent-Shift achieves comparable or better results while being significantly more efficient.
arXiv Detail & Related papers (2023-04-17T17:57:06Z) - Phenaki: Variable Length Video Generation From Open Domain Textual
Description [21.610541668826006]
Phenaki is a model capable of realistic video synthesis given a sequence of textual prompts.
New model for learning video representation compresses the video to a small representation of discrete tokens.
To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts.
arXiv Detail & Related papers (2022-10-05T17:18:28Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP [17.861540412002967]
We propose a self-supervised scheme named as CLIP-GEN for general text-to-image generation.
In our approach, we only require a set of unlabeled images in the general domain to train a text-to-image generator.
Our method significantly outperforms optimization-based text-to-image methods in terms of image quality.
arXiv Detail & Related papers (2022-03-01T12:11:32Z) - Bornon: Bengali Image Captioning with Transformer-based Deep learning
approach [0.0]
Transformer model is used to generate captions from images using English datasets.
We used three different Bengali datasets to generate Bengali captions from images using the Transformer model.
We compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.
arXiv Detail & Related papers (2021-09-11T08:29:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.