Related papers: Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference

Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference

URL: http://arxiv.org/abs/2412.02962v1
Date: Wed, 04 Dec 2024 02:12:50 GMT
Title: Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference
Authors: XiuYu Zhang, Zening Luo, Michelle E. Lu,
Abstract summary: Diffusion models have exhibited exciting capabilities in generating images and are also very promising for video creation.<n>The sequential denoising steps required for generating a single sample could take tens or hundreds of iterations.<n>We propose Partially Conditioned Patch Parallelism to accelerate the inference of high-resolution diffusion models.
Score: 0.7619404259039282
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models have exhibited exciting capabilities in generating images and are also very promising for video creation. However, the inference speed of diffusion models is limited by the slow sampling process, restricting its use cases. The sequential denoising steps required for generating a single sample could take tens or hundreds of iterations and thus have become a significant bottleneck. This limitation is more salient for applications that are interactive in nature or require small latency. To address this challenge, we propose Partially Conditioned Patch Parallelism (PCPP) to accelerate the inference of high-resolution diffusion models. Using the fact that the difference between the images in adjacent diffusion steps is nearly zero, Patch Parallelism (PP) leverages multiple GPUs communicating asynchronously to compute patches of an image in multiple computing devices based on the entire image (all patches) in the previous diffusion step. PCPP develops PP to reduce computation in inference by conditioning only on parts of the neighboring patches in each diffusion step, which also decreases communication among computing devices. As a result, PCPP decreases the communication cost by around $70\%$ compared to DistriFusion (the state of the art implementation of PP) and achieves $2.36\sim 8.02\times$ inference speed-up using $4\sim 8$ GPUs compared to $2.32\sim 6.71\times$ achieved by DistriFusion depending on the computing device configuration and resolution of generation at the cost of a possible decrease in image quality. PCPP demonstrates the potential to strike a favorable trade-off, enabling high-quality image generation with substantially reduced latency.

Related papers

One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models [65.96186414865747]
Text-to-Image (T2I) diffusion models face a trade-off between inference speed and image quality.<n>We introduce the first Time-independent Unified TiUE for the student model UNet architecture.<n>Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling.
arXiv Detail & Related papers (2025-05-28T04:23:22Z)
DDT: Decoupled Diffusion Transformer [51.84206763079382]
Diffusion transformers encode noisy inputs to extract semantic component and decode higher frequency with identical modules. textbfcolorddtDecoupled textbfcolorddtTransformer(textbfcolorddtDDT) textbfcolorddtTransformer(textbfcolorddtDDT) textbfcolorddtTransformer(textbfcolorddtDDT)
arXiv Detail & Related papers (2025-04-08T07:17:45Z)
Open-Source Acceleration of Stable-Diffusion.cpp Deployable on All Devices [28.774856591172902]
stable-diffusion.Turbo (Sd) emerges as an efficient inference framework to accelerate the diffusion models. In this work, we present an optimized version of Sd leveraging the Winograd algorithm to accelerate 2D convolution operations. We demonstrate a speedup up to 2.76x for individual convolutional layers and an inference speedup up to 4.79x for the overall image generation process.
arXiv Detail & Related papers (2024-12-08T02:27:17Z)
Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
We propose an algorithm that enables fast and high-quality generation under arbitrary constraints. During inference, we can interchange between gradient updates computed on the noisy image and updates computed on the final, clean image. Our approach produces results that rival or surpass the state-of-the-art training-free inference approaches.
arXiv Detail & Related papers (2024-10-24T14:52:38Z)
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising [49.785626309848276]
AsyncDiff is a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. For the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score. Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances.
arXiv Detail & Related papers (2024-06-11T03:09:37Z)
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models [44.384572903945724]
We propose DistriFusion to tackle the problem of generating high-resolution images with diffusion models. Our method splits the model input into multiple patches and assigns each patch to a GPU. Our method can be applied to recent Stable Diffusion XL with no quality degradation and achieve up to a 6.1$times$ speedup on eight NVIDIA A100s compared to one.
arXiv Detail & Related papers (2024-02-29T18:59:58Z)
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation [29.30999290150683]
We introduce StreamDiffusion, a real-time diffusion pipeline for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. We present a novel approach that transforms the original sequential denoising into the denoising process.
arXiv Detail & Related papers (2023-12-19T18:18:33Z)
One-step Diffusion with Distribution Matching Distillation [54.723565605974294]
We introduce Distribution Matching Distillation (DMD), a procedure to transform a diffusion model into a one-step image generator. We enforce the one-step image generator match the diffusion model at distribution level, by minimizing an approximate KL divergence. Our method outperforms all published few-step diffusion approaches, reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot COCO-30k.
arXiv Detail & Related papers (2023-11-30T18:59:20Z)
ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models [59.90959789767886]
We show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions. By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 64$times$64 and LSUN Cat 256$times$256 datasets.
arXiv Detail & Related papers (2023-11-23T16:49:06Z)
Continuous Cost Aggregation for Dual-Pixel Disparity Extraction [3.1153758106426603]
We propose a continuous cost aggregation scheme for Dual-Pixel (DP) images. The proposed algorithm fits parabolas to matching costs and aggregates parabola coefficients along image paths. Experiments on DP data from both DSLR and phone cameras show that the proposed scheme attains state-of-the-art performance in DP disparity estimation.
arXiv Detail & Related papers (2023-06-13T17:26:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.