FlashVideo: A Framework for Swift Inference in Text-to-Video Generation
- URL: http://arxiv.org/abs/2401.00869v1
- Date: Sat, 30 Dec 2023 00:06:28 GMT
- Title: FlashVideo: A Framework for Swift Inference in Text-to-Video Generation
- Authors: Bin Lei, le Chen, Caiwen Ding
- Abstract summary: This paper introduces FlashVideo, a novel framework tailored for swift Text-to-Video generation.
FlashVideo reduces the time complexity of inference from $mathcalO(L2)$ to $mathcalO(L)$ for a sequence of length $L$, significantly accelerating inference speed.
Our comprehensive experiments demonstrate that FlashVideo achieves a $times9.17$ improvement over a traditional autoregressive-based transformer model, and its inference speed is of the same order of magnitude as that of BERT-based transformer models.
- Score: 9.665089218030086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the evolving field of machine learning, video generation has witnessed
significant advancements with autoregressive-based transformer models and
diffusion models, known for synthesizing dynamic and realistic scenes. However,
these models often face challenges with prolonged inference times, even for
generating short video clips such as GIFs. This paper introduces FlashVideo, a
novel framework tailored for swift Text-to-Video generation. FlashVideo
represents the first successful adaptation of the RetNet architecture for video
generation, bringing a unique approach to the field. Leveraging the
RetNet-based architecture, FlashVideo reduces the time complexity of inference
from $\mathcal{O}(L^2)$ to $\mathcal{O}(L)$ for a sequence of length $L$,
significantly accelerating inference speed. Additionally, we adopt a
redundant-free frame interpolation method, enhancing the efficiency of frame
interpolation. Our comprehensive experiments demonstrate that FlashVideo
achieves a $\times9.17$ efficiency improvement over a traditional
autoregressive-based transformer model, and its inference speed is of the same
order of magnitude as that of BERT-based transformer models.
Related papers
- Adaptive Caching for Faster Video Generation with Diffusion Transformers [52.73348147077075]
Diffusion Transformers (DiTs) rely on larger models and heavier attention mechanisms, resulting in slower inference speeds.
We introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache)
We also introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, controlling the compute allocation based on motion content.
arXiv Detail & Related papers (2024-11-04T18:59:44Z) - TVG: A Training-free Transition Video Generation Method with Diffusion Models [12.037716102326993]
Transition videos play a crucial role in media production, enhancing the flow and coherence of visual narratives.
Recent advances in diffusion model-based video generation offer new possibilities for creating transitions but face challenges such as poor inter-frame relationship modeling and abrupt content changes.
We propose a novel training-free Transition Video Generation (TVG) approach using video-level diffusion models that addresses these limitations without additional training.
arXiv Detail & Related papers (2024-08-24T00:33:14Z) - Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
Synthesis [69.83405335645305]
We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability.
In this work, we build Snap Video, a video-first model that systematically addresses these challenges.
We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead.
This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
arXiv Detail & Related papers (2024-02-22T18:55:08Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Lumiere: A Space-Time Diffusion Model for Video Generation [75.54967294846686]
We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once.
This is in contrast to existing video models which synthesize distants followed by temporal super-resolution.
By deploying both spatial and (importantly) temporal down- and up-sampling, our model learns to directly generate a full-frame-rate, low-resolution video.
arXiv Detail & Related papers (2024-01-23T18:05:25Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - ControlVideo: Training-free Controllable Text-to-Video Generation [117.06302461557044]
ControlVideo is a framework to enable natural and efficient text-to-video generation.
It generates both short and long videos within several minutes using one NVIDIA 2080Ti.
arXiv Detail & Related papers (2023-05-22T14:48:53Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - Video Transformer Network [0.0]
This paper presents a transformer-based framework for video recognition.
Inspired by recent developments in vision transformers, we ditch the standard approach in video action recognition that relies on 3D ConvNets.
Our approach is generic and builds on top of any given 2D spatial network.
arXiv Detail & Related papers (2021-02-01T09:29:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.