FusionFrames: Efficient Architectural Aspects for Text-to-Video
Generation Pipeline
- URL: http://arxiv.org/abs/2311.13073v2
- Date: Wed, 20 Dec 2023 15:58:26 GMT
- Title: FusionFrames: Efficient Architectural Aspects for Text-to-Video
Generation Pipeline
- Authors: Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, Elizaveta
Dakhova, Andrey Kuznetsov, Denis Dimitrov
- Abstract summary: This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model.
The design of our model significantly reduces computational costs compared to other masked frame approaches.
We evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores.
- Score: 4.295130967329365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimedia generation approaches occupy a prominent place in artificial
intelligence research. Text-to-image models achieved high-quality results over
the last few years. However, video synthesis methods recently started to
develop. This paper presents a new two-stage latent diffusion text-to-video
generation architecture based on the text-to-image diffusion model. The first
stage concerns keyframes synthesis to figure the storyline of a video, while
the second one is devoted to interpolation frames generation to make movements
of the scene and objects smooth. We compare several temporal conditioning
approaches for keyframes generation. The results show the advantage of using
separate temporal blocks over temporal layers in terms of metrics reflecting
video generation quality aspects and human preference. The design of our
interpolation model significantly reduces computational costs compared to other
masked frame interpolation approaches. Furthermore, we evaluate different
configurations of MoVQ-based video decoding scheme to improve consistency and
achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our
pipeline with existing solutions and achieve top-2 scores overall and top-1
among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page:
https://ai-forever.github.io/kandinsky-video/
Related papers
- ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation.
We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning.
Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z) - Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM.
We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - WALDO: Future Video Synthesis using Object Layer Decomposition and
Parametric Flow Prediction [82.79642869586587]
WALDO is a novel approach to the prediction of future video frames from past ones.
Individual images are decomposed into multiple layers combining object masks and a small set of control points.
The layer structure is shared across all frames in each video to build dense inter-frame connections.
arXiv Detail & Related papers (2022-11-25T18:59:46Z) - FILM: Frame Interpolation for Large Motion [20.04001872133824]
We present a frame algorithm that synthesizes multiple intermediate frames from two input images with large in-between motion.
Our approach outperforms state-of-the-art methods on the Xiph large motion benchmark.
arXiv Detail & Related papers (2022-02-10T08:48:18Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z) - W-Cell-Net: Multi-frame Interpolation of Cellular Microscopy Videos [1.7205106391379026]
We apply recent advances in Deep video convolution to increase the temporal resolution of fluorescent microscopy time-lapse movies.
To our knowledge, there is no previous work that uses Conal Neural Networks (CNN) to generate frames between two consecutive microscopy images.
arXiv Detail & Related papers (2020-05-14T01:33:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.