Video Generation from Text Employing Latent Path Construction for
Temporal Modeling
- URL: http://arxiv.org/abs/2107.13766v1
- Date: Thu, 29 Jul 2021 06:28:20 GMT
- Title: Video Generation from Text Employing Latent Path Construction for
Temporal Modeling
- Authors: Amir Mazaheri, Mubarak Shah
- Abstract summary: Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
- Score: 70.06508219998778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video generation is one of the most challenging tasks in Machine Learning and
Computer Vision fields of study. In this paper, we tackle the text to video
generation problem, which is a conditional form of video generation. Humans can
listen/read natural language sentences, and can imagine or visualize what is
being described; therefore, we believe that video generation from natural
language sentences will have an important impact on Artificial Intelligence.
Video generation is relatively a new field of study in Computer Vision, which
is far from being solved. The majority of recent works deal with synthetic
datasets or real datasets with very limited types of objects, scenes, and
emotions. To the best of our knowledge, this is the very first work on the text
(free-form sentences) to video generation on more realistic video datasets like
Actor and Action Dataset (A2D) or UCF101. We tackle the complicated problem of
video generation by regressing the latent representations of the first and last
frames and employing a context-aware interpolation method to build the latent
representations of in-between frames. We propose a stacking ``upPooling'' block
to sequentially generate RGB frames out of each latent representation and
progressively increase the resolution. Moreover, our proposed Discriminator
encodes videos based on single and multiple frames. We provide quantitative and
qualitative results to support our arguments and show the superiority of our
method over well-known baselines like Recurrent Neural Network (RNN) and
Deconvolution (as known as Convolutional Transpose) based video generation
methods.
Related papers
- Grid Diffusion Models for Text-to-Video Generation [2.531998650341267]
Most existing video generation methods use either a 3D U-Net architecture that considers the temporal dimension or autoregressive generation.
We propose a simple but effective novel grid diffusion for text-to-video generation without temporal dimension in architecture and a large text-video paired dataset.
Our proposed method outperforms the existing methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2024-03-30T03:50:43Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - UniVG: Towards UNIfied-modal Video Generation [27.07637246141562]
We propose a Unified-modal Video Genearation system capable of handling multiple video generation tasks across text and image modalities.
Our method achieves the lowest Fr'echet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2.
arXiv Detail & Related papers (2024-01-17T09:46:13Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - LIFI: Towards Linguistically Informed Frame Interpolation [66.05105400951567]
We try to solve this problem by using several deep learning video generation algorithms to generate missing frames.
We release several datasets to test computer vision video generation models of their speech understanding.
arXiv Detail & Related papers (2020-10-30T05:02:23Z) - TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary
Generator [34.7504057664375]
We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video.
Step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions.
arXiv Detail & Related papers (2020-09-04T06:33:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.