F3-Pruning: A Training-Free and Generalized Pruning Strategy towards
Faster and Finer Text-to-Video Synthesis
- URL: http://arxiv.org/abs/2312.03459v1
- Date: Wed, 6 Dec 2023 12:34:47 GMT
- Title: F3-Pruning: A Training-Free and Generalized Pruning Strategy towards
Faster and Finer Text-to-Video Synthesis
- Authors: Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song
- Abstract summary: We explore the inference process of two mainstream T2V models using transformers and diffusion models.
We propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.
Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning.
- Score: 94.10861578387443
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by
training transformers or diffusion models on large-scale datasets.
Nevertheless, inferring such large models incurs huge costs.Previous inference
acceleration works either require costly retraining or are model-specific.To
address this issue, instead of retraining we explore the inference process of
two mainstream T2V models using transformers and diffusion models.The
exploration reveals the redundancy in temporal attention modules of both
models, which are commonly utilized to establish temporal relations among
frames.Consequently, we propose a training-free and generalized pruning
strategy called F3-Pruning to prune redundant temporal attention
weights.Specifically, when aggregate temporal attention values are ranked below
a certain ratio, corresponding weights will be pruned.Extensive experiments on
three datasets using a classic transformer-based model CogVideo and a typical
diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning in
inference acceleration, quality assurance and broad applicability.
Related papers
- RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane
Networks [63.84589410872608]
We present a novel unconditional video generative model designed to address long-term spatial and temporal dependencies.
Our approach reduces computational complexity by a factor of $2$ as measured in FLOPs.
Our model is capable of synthesizing high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - Generative Time Series Forecasting with Diffusion, Denoise, and
Disentanglement [51.55157852647306]
Time series forecasting has been a widely explored task of great importance in many applications.
It is common that real-world time series data are recorded in a short time period, which results in a big gap between the deep model and the limited and noisy time series.
We propose to address the time series forecasting problem with generative modeling and propose a bidirectional variational auto-encoder equipped with diffusion, denoise, and disentanglement.
arXiv Detail & Related papers (2023-01-08T12:20:46Z) - Towards Long-Term Time-Series Forecasting: Feature, Pattern, and
Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning.
Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism.
We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z) - Imaging through the Atmosphere using Turbulence Mitigation Transformer [15.56320865332645]
Restoring images distorted by atmospheric turbulence is a ubiquitous problem in long-range imaging applications.
Existing deep-learning-based methods have demonstrated promising results in specific testing conditions.
We introduce the turbulence mitigation transformer (TMT) that explicitly addresses these issues.
arXiv Detail & Related papers (2022-07-13T18:33:26Z) - Temporal Transformer Networks with Self-Supervision for Action
Recognition [13.00827959393591]
We introduce a startling Temporal Transformer Network with Self-supervision (TTSN)
TTSN consists of a temporal transformer module and a temporal sequence self-supervision module.
Our proposed TTSN is promising as it successfully achieves state-of-the-art performance for action recognition.
arXiv Detail & Related papers (2021-12-14T12:53:53Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - A Log-likelihood Regularized KL Divergence for Video Prediction with A
3D Convolutional Variational Recurrent Network [17.91970304953206]
We introduce a new variational model that extends the recurrent network in two ways for the task of frame prediction.
First, we introduce 3D convolutions inside all modules including the recurrent model for future prediction frame, inputting sequence and outputting video frames at each timestep.
Second, we enhance the latent loss predictions of the variational model by introducing a maximum likelihood estimate in addition to the KL that is commonly used in variational models.
arXiv Detail & Related papers (2020-12-11T05:05:31Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.