F3-Pruning: A Training-Free and Generalized Pruning Strategy towards
Faster and Finer Text-to-Video Synthesis
- URL: http://arxiv.org/abs/2312.03459v1
- Date: Wed, 6 Dec 2023 12:34:47 GMT
- Title: F3-Pruning: A Training-Free and Generalized Pruning Strategy towards
Faster and Finer Text-to-Video Synthesis
- Authors: Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song
- Abstract summary: We explore the inference process of two mainstream T2V models using transformers and diffusion models.
We propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.
Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning.
- Score: 94.10861578387443
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by
training transformers or diffusion models on large-scale datasets.
Nevertheless, inferring such large models incurs huge costs.Previous inference
acceleration works either require costly retraining or are model-specific.To
address this issue, instead of retraining we explore the inference process of
two mainstream T2V models using transformers and diffusion models.The
exploration reveals the redundancy in temporal attention modules of both
models, which are commonly utilized to establish temporal relations among
frames.Consequently, we propose a training-free and generalized pruning
strategy called F3-Pruning to prune redundant temporal attention
weights.Specifically, when aggregate temporal attention values are ranked below
a certain ratio, corresponding weights will be pruned.Extensive experiments on
three datasets using a classic transformer-based model CogVideo and a typical
diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning in
inference acceleration, quality assurance and broad applicability.
Related papers
- ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation [83.62931466231898]
This paper presents ARLON, a framework that boosts diffusion Transformers with autoregressive models for long video generation.
A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens.
An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model.
arXiv Detail & Related papers (2024-10-27T16:28:28Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - Towards Long-Term Time-Series Forecasting: Feature, Pattern, and
Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning.
Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism.
We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z) - Imaging through the Atmosphere using Turbulence Mitigation Transformer [15.56320865332645]
Restoring images distorted by atmospheric turbulence is a ubiquitous problem in long-range imaging applications.
Existing deep-learning-based methods have demonstrated promising results in specific testing conditions.
We introduce the turbulence mitigation transformer (TMT) that explicitly addresses these issues.
arXiv Detail & Related papers (2022-07-13T18:33:26Z) - Temporal Transformer Networks with Self-Supervision for Action
Recognition [13.00827959393591]
We introduce a startling Temporal Transformer Network with Self-supervision (TTSN)
TTSN consists of a temporal transformer module and a temporal sequence self-supervision module.
Our proposed TTSN is promising as it successfully achieves state-of-the-art performance for action recognition.
arXiv Detail & Related papers (2021-12-14T12:53:53Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.