Faster Diffusion via Temporal Attention Decomposition
- URL: http://arxiv.org/abs/2404.02747v2
- Date: Wed, 17 Jul 2024 23:09:10 GMT
- Title: Faster Diffusion via Temporal Attention Decomposition
- Authors: Haozhe Liu, Wentian Zhang, Jinheng Xie, Francesco Faccio, Mengmeng Xu, Tao Xiang, Mike Zheng Shou, Juan-Manuel Perez-Rua, Jürgen Schmidhuber,
- Abstract summary: We explore the role of attention mechanism during inference in text-conditional diffusion models.
We develop a training-free method known as temporally gating the attention (TGATE)
TGATE efficiently generates images by caching and reusing attention outputs at scheduled time steps.
- Score: 77.90640748930178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore the role of attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases: an initial phase for planning text-oriented visual semantics, which are then translated into images in a subsequent fidelity-improving phase. Cross-attention is essential in the initial phase but almost irrelevant thereafter. However, self-attention initially plays a minor role but becomes crucial in the second phase. These findings yield a simple and training-free method known as temporally gating the attention (TGATE), which efficiently generates images by caching and reusing attention outputs at scheduled time steps. Experimental results show when widely applied to various existing text-conditional diffusion models, TGATE accelerates these models by 10%-50%. The code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.
Related papers
- Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model [57.24046436423511]
Recently, the strong latent Diffusion Probabilistic Model (DPM) has been applied to high-quality Text-to-Image (T2I) generation.
We explore the mechanism behind DPM by examining the intermediate statuses during the gradual denoising generation process.
We propose to apply this observation to accelerate the process of T2I generation by properly removing text guidance.
arXiv Detail & Related papers (2024-05-24T08:12:41Z) - Self-Supervised Learning of Time Series Representation via Diffusion Process and Imputation-Interpolation-Forecasting Mask [6.579109660479191]
Time Series Diffusion Embedding (TSDE) is first diffusion-based SSL TSRL approach.
It segments TS data into observed and masked parts using an Imputation-Interpolation-Forecasting (IIF) mask.
It applies a trainable embedding function, featuring dual-orthogonal Transformer encoders with a crossover mechanism.
arXiv Detail & Related papers (2024-05-09T17:55:16Z) - Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z) - Improving Misaligned Multi-modality Image Fusion with One-stage
Progressive Dense Registration [67.23451452670282]
Misalignments between multi-modality images pose challenges in image fusion.
We propose a Cross-modality Multi-scale Progressive Dense Registration scheme.
This scheme accomplishes the coarse-to-fine registration exclusively using a one-stage optimization.
arXiv Detail & Related papers (2023-08-22T03:46:24Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery [20.787180028571694]
DiffusionSeg is a synthesis-exploitation framework containing two-stage strategies.
We synthesize abundant images, and propose a novel training-free AttentionCut to obtain masks in the first stage.
In the second exploitation stage, to bridge the structural gap, we use the inversion technique, to map the given image back to diffusion features.
arXiv Detail & Related papers (2023-03-17T07:47:55Z) - Light Field Saliency Detection with Dual Local Graph Learning
andReciprocative Guidance [148.9832328803202]
We model the infor-mation fusion within focal stack via graph networks.
We build a novel dual graph modelto guide the focal stack fusion process using all-focus pat-terns.
arXiv Detail & Related papers (2021-10-02T00:54:39Z) - TTAN: Two-Stage Temporal Alignment Network for Few-shot Action
Recognition [29.95184808021684]
Few-shot action recognition aims to recognize novel action classes (query) using just a few samples (support)
We devise a novel multi-shot fusion strategy, which takes the misalignment among support samples into consideration.
Experiments on benchmark datasets show the potential of the proposed method in achieving state-of-the-art performance for few-shot action recognition.
arXiv Detail & Related papers (2021-07-10T07:22:49Z) - MS-TCN++: Multi-Stage Temporal Convolutional Network for Action
Segmentation [87.16030562892537]
We propose a multi-stage architecture for the temporal action segmentation task.
The first stage generates an initial prediction that is refined by the next ones.
Our models achieve state-of-the-art results on three datasets.
arXiv Detail & Related papers (2020-06-16T14:50:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.