TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary
Generator
- URL: http://arxiv.org/abs/2009.02018v2
- Date: Mon, 28 Jun 2021 00:25:23 GMT
- Title: TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary
Generator
- Authors: Doyeon Kim, Donggyu Joo, Junmo Kim
- Abstract summary: We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video.
Step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions.
- Score: 34.7504057664375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in technology have led to the development of methods that can create
desired visual multimedia. In particular, image generation using deep learning
has been extensively studied across diverse fields. In comparison, video
generation, especially on conditional inputs, remains a challenging and less
explored area. To narrow this gap, we aim to train our model to produce a video
corresponding to a given text description. We propose a novel training
framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN),
which evolves frame-by-frame and finally produces a full-length video. In the
first phase, we focus on creating a high-quality single video frame while
learning the relationship between the text and an image. As the steps proceed,
our model is trained gradually on more number of consecutive frames.This
step-by-step learning process helps stabilize the training and enables the
creation of high-resolution video based on conditional text descriptions.
Qualitative and quantitative experimental results on various datasets
demonstrate the effectiveness of the proposed method.
Related papers
- I4VGen: Image as Free Stepping Stone for Text-to-Video Generation [28.910648256877113]
We present I4VGen, a novel video diffusion inference pipeline to enhance pre-trained text-to-video diffusion models.
I4VGen consists of two stages: anchor image synthesis and anchor image-augmented text-to-video synthesis.
Experiments show that the proposed method produces videos with higher visual realism and textual fidelity datasets.
arXiv Detail & Related papers (2024-06-04T11:48:44Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.