Cache Me if You Can: Accelerating Diffusion Models through Block Caching
- URL: http://arxiv.org/abs/2312.03209v2
- Date: Fri, 12 Jan 2024 09:26:45 GMT
- Title: Cache Me if You Can: Accelerating Diffusion Models through Block Caching
- Authors: Felix Wimbauer, Bichen Wu, Edgar Schoenfeld, Xiaoliang Dai, Ji Hou,
Zijian He, Artsiom Sanakoyeu, Peizhao Zhang, Sam Tsai, Jonas Kohler,
Christian Rupprecht, Daniel Cremers, Peter Vajda, Jialiang Wang
- Abstract summary: A large image-to-image network has to be applied many times to iteratively refine an image from random noise.
We investigate the behavior of the layers within the network and find that 1) the layers' output changes smoothly over time, 2) the layers show distinct patterns of change, and 3) the change from step to step is often very small.
We propose a technique to automatically determine caching schedules based on each block's changes over timesteps.
- Score: 67.54820800003375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have recently revolutionized the field of image synthesis
due to their ability to generate photorealistic images. However, one of the
major drawbacks of diffusion models is that the image generation process is
costly. A large image-to-image network has to be applied many times to
iteratively refine an image from random noise. While many recent works propose
techniques to reduce the number of required steps, they generally treat the
underlying denoising network as a black box. In this work, we investigate the
behavior of the layers within the network and find that 1) the layers' output
changes smoothly over time, 2) the layers show distinct patterns of change, and
3) the change from step to step is often very small. We hypothesize that many
layer computations in the denoising network are redundant. Leveraging this, we
introduce block caching, in which we reuse outputs from layer blocks of
previous steps to speed up inference. Furthermore, we propose a technique to
automatically determine caching schedules based on each block's changes over
timesteps. In our experiments, we show through FID, human evaluation and
qualitative analysis that Block Caching allows to generate images with higher
visual quality at the same computational cost. We demonstrate this for
different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM).
Related papers
- Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model [31.70050311326183]
Diffusion models tend to generate videos with less motion than expected.
We address this issue from both inference and training aspects.
Our methods outperform baselines by producing higher motion scores with lower errors.
arXiv Detail & Related papers (2024-06-22T04:56:16Z) - Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models [26.926712014346432]
This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization.
Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512.
arXiv Detail & Related papers (2024-06-13T17:59:58Z) - Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters.
We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers.
Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z) - Clockwork Diffusion: Efficient Generation With Model-Step Distillation [42.01130983628078]
Clockwork Diffusion is a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps.
For both text-to-image generation and image editing, we demonstrate that Clockwork leads to comparable or improved perceptual scores with drastically reduced computational complexity.
arXiv Detail & Related papers (2023-12-13T13:30:27Z) - DeepCache: Accelerating Diffusion Models for Free [65.02607075556742]
DeepCache is a training-free paradigm that accelerates diffusion models from the perspective of model architecture.
DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models.
Under the same throughput, DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS.
arXiv Detail & Related papers (2023-12-01T17:01:06Z) - SDM: Spatial Diffusion Model for Large Hole Image Inpainting [106.90795513361498]
We present a novel spatial diffusion model (SDM) that uses a few iterations to gradually deliver informative pixels to the entire image.
Also, thanks to the proposed decoupled probabilistic modeling and spatial diffusion scheme, our method achieves high-quality large-hole completion.
arXiv Detail & Related papers (2022-12-06T13:30:18Z) - Improving Diffusion Model Efficiency Through Patching [0.0]
We find that adding a simple ViT-style patching transformation can considerably reduce a diffusion model's sampling time and memory usage.
We justify our approach both through an analysis of diffusion model objective, and through empirical experiments on LSUN Church, ImageNet 256, and FFHQ 1024.
arXiv Detail & Related papers (2022-07-09T18:21:32Z) - Dynamic Dual-Output Diffusion Models [100.32273175423146]
Iterative denoising-based generation has been shown to be comparable in quality to other classes of generative models.
A major drawback of this method is that it requires hundreds of iterations to produce a competitive result.
Recent works have proposed solutions that allow for faster generation with fewer iterations, but the image quality gradually deteriorates.
arXiv Detail & Related papers (2022-03-08T11:20:40Z) - Powers of layers for image-to-image translation [60.5529622990682]
We propose a simple architecture to address unpaired image-to-image translation tasks.
We start from an image autoencoder architecture with fixed weights.
For each task we learn a residual block operating in the latent space, which is iteratively called until the target domain is reached.
arXiv Detail & Related papers (2020-08-13T09:02:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.