UniVG: Towards UNIfied-modal Video Generation
- URL: http://arxiv.org/abs/2401.09084v1
- Date: Wed, 17 Jan 2024 09:46:13 GMT
- Title: UniVG: Towards UNIfied-modal Video Generation
- Authors: Ludan Ruan, Lei Tian, Chuanwei Huang, Xu Zhang, Xinyan Xiao
- Abstract summary: We propose a Unified-modal Video Genearation system capable of handling multiple video generation tasks across text and image modalities.
Our method achieves the lowest Fr'echet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2.
- Score: 27.07637246141562
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion based video generation has received extensive attention and
achieved considerable success within both the academic and industrial
communities. However, current efforts are mainly concentrated on
single-objective or single-task video generation, such as generation driven by
text, by image, or by a combination of text and image. This cannot fully meet
the needs of real-world application scenarios, as users are likely to input
images and text conditions in a flexible manner, either individually or in
combination. To address this, we propose a Unified-modal Video Genearation
system that is capable of handling multiple video generation tasks across text
and image modalities. To this end, we revisit the various video generation
tasks within our system from the perspective of generative freedom, and
classify them into high-freedom and low-freedom video generation categories.
For high-freedom video generation, we employ Multi-condition Cross Attention to
generate videos that align with the semantics of the input images or text. For
low-freedom video generation, we introduce Biased Gaussian Noise to replace the
pure random Gaussian Noise, which helps to better preserve the content of the
input conditions. Our method achieves the lowest Fr\'echet Video Distance (FVD)
on the public academic benchmark MSR-VTT, surpasses the current open-source
methods in human evaluations, and is on par with the current close-source
method Gen2. For more samples, visit https://univg-baidu.github.io.
Related papers
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user.
Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process.
Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z) - DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo.
Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z) - VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z) - Gen-L-Video: Multi-Text to Long Video Generation via Temporal
Co-Denoising [43.35391175319815]
This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos.
We introduce a novel paradigm dubbed Gen-L-Video, capable of extending off-the-shelf short video diffusion models.
Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models.
arXiv Detail & Related papers (2023-05-29T17:38:18Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z) - Non-Adversarial Video Synthesis with Learned Priors [53.26777815740381]
We focus on the problem of generating videos from latent noise vectors, without any reference input frames.
We develop a novel approach that jointly optimize the input latent space, the weights of a recurrent neural network and a generator through non-adversarial learning.
Our approach generates superior quality videos compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2020-03-21T02:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.