Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion
Models
- URL: http://arxiv.org/abs/2312.09608v1
- Date: Fri, 15 Dec 2023 08:46:43 GMT
- Title: Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion
Models
- Authors: Senmao Li, Taihang Hu, Fahad Shahbaz Khan, Linxuan Li, Shiqi Yang,
Yaxing Wang, Ming-Ming Cheng and Jian Yang
- Abstract summary: We conduct the first comprehensive study of the UNet encoder.
We find that encoder features change gently, whereas the decoder features exhibit substantial variations across different time-steps.
By benefiting from our propagation scheme, we are able to perform in parallel the decoder at certain adjacent time-steps.
- Score: 95.47438940934413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the key components within diffusion models is the UNet for noise
prediction. While several works have explored basic properties of the UNet
decoder, its encoder largely remains unexplored. In this work, we conduct the
first comprehensive study of the UNet encoder. We empirically analyze the
encoder features and provide insights to important questions regarding their
changes at the inference process. In particular, we find that encoder features
change gently, whereas the decoder features exhibit substantial variations
across different time-steps. This finding inspired us to omit the encoder at
certain adjacent time-steps and reuse cyclically the encoder features in the
previous time-steps for the decoder. Further based on this observation, we
introduce a simple yet effective encoder propagation scheme to accelerate the
diffusion sampling for a diverse set of tasks. By benefiting from our
propagation scheme, we are able to perform in parallel the decoder at certain
adjacent time-steps. Additionally, we introduce a prior noise injection method
to improve the texture details in the generated image. Besides the standard
text-to-image task, we also validate our approach on other tasks:
text-to-video, personalized generation and reference-guided generation. Without
utilizing any knowledge distillation technique, our approach accelerates both
the Stable Diffusion (SD) and the DeepFloyd-IF models sampling by 41$\%$ and
24$\%$ respectively, while maintaining high-quality generation performance. Our
code is available in
\href{https://github.com/hutaiHang/Faster-Diffusion}{FasterDiffusion}.
Related papers
- $ε$-VAE: Denoising as Visual Decoding [61.29255979767292]
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space.
Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input.
We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder.
We evaluate our approach by assessing both reconstruction (rFID) and generation quality (
arXiv Detail & Related papers (2024-10-05T08:27:53Z) - Correcting Diffusion-Based Perceptual Image Compression with Privileged End-to-End Decoder [49.01721042973929]
This paper presents a diffusion-based image compression method that employs a privileged end-to-end decoder model as correction.
Experiments demonstrate the superiority of our method in both distortion and perception compared with previous perceptual compression methods.
arXiv Detail & Related papers (2024-04-07T10:57:54Z) - Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder [29.924160271522354]
Super-resolution (SR) and image generation are important tasks in computer vision and are widely adopted in real-world applications.
Most existing methods, however, generate images only at fixed-scale magnification and suffer from over-smoothing and artifacts.
Most relevant work applied Implicit Neural Representation (INR) to the denoising diffusion model to obtain continuous-resolution yet diverse and high-quality SR results.
We propose a novel pipeline that can super-resolve an input image or generate from a random noise a novel image at arbitrary scales.
arXiv Detail & Related papers (2024-03-15T12:45:40Z) - Clockwork Diffusion: Efficient Generation With Model-Step Distillation [42.01130983628078]
Clockwork Diffusion is a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps.
For both text-to-image generation and image editing, we demonstrate that Clockwork leads to comparable or improved perceptual scores with drastically reduced computational complexity.
arXiv Detail & Related papers (2023-12-13T13:30:27Z) - DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder
Transformer Models [22.276574156358084]
We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions.
We show our approach can reduce overall inference latency by 30%-60% with comparable or even higher accuracy compared to baselines.
arXiv Detail & Related papers (2023-11-15T01:01:02Z) - Towards More Accurate Diffusion Model Acceleration with A Timestep
Aligner [84.97253871387028]
A diffusion model, which is formulated to produce an image using thousands of denoising steps, usually suffers from a slow inference speed.
We propose a timestep aligner that helps find a more accurate integral direction for a particular interval at the minimum cost.
Experiments show that our plug-in design can be trained efficiently and boost the inference performance of various state-of-the-art acceleration methods.
arXiv Detail & Related papers (2023-10-14T02:19:07Z) - Complexity Matters: Rethinking the Latent Space for Generative Modeling [65.64763873078114]
In generative modeling, numerous successful approaches leverage a low-dimensional latent space, e.g., Stable Diffusion.
In this study, we aim to shed light on this under-explored topic by rethinking the latent space from the perspective of model complexity.
arXiv Detail & Related papers (2023-07-17T07:12:29Z) - Lossy Image Compression with Conditional Diffusion Models [25.158390422252097]
This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models.
In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model.
Our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics.
arXiv Detail & Related papers (2022-09-14T21:53:27Z) - Neural Data-Dependent Transform for Learned Image Compression [72.86505042102155]
We build a neural data-dependent transform and introduce a continuous online mode decision mechanism to jointly optimize the coding efficiency for each individual image.
The experimental results show the effectiveness of the proposed neural-syntax design and the continuous online mode decision mechanism.
arXiv Detail & Related papers (2022-03-09T14:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.