Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference
- URL: http://arxiv.org/abs/2412.02962v1
- Date: Wed, 04 Dec 2024 02:12:50 GMT
- Title: Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference
- Authors: XiuYu Zhang, Zening Luo, Michelle E. Lu,
- Abstract summary: Diffusion models have exhibited exciting capabilities in generating images and are also very promising for video creation.
The sequential denoising steps required for generating a single sample could take tens or hundreds of iterations.
We propose Partially Conditioned Patch Parallelism to accelerate the inference of high-resolution diffusion models.
- Score: 0.7619404259039282
- License:
- Abstract: Diffusion models have exhibited exciting capabilities in generating images and are also very promising for video creation. However, the inference speed of diffusion models is limited by the slow sampling process, restricting its use cases. The sequential denoising steps required for generating a single sample could take tens or hundreds of iterations and thus have become a significant bottleneck. This limitation is more salient for applications that are interactive in nature or require small latency. To address this challenge, we propose Partially Conditioned Patch Parallelism (PCPP) to accelerate the inference of high-resolution diffusion models. Using the fact that the difference between the images in adjacent diffusion steps is nearly zero, Patch Parallelism (PP) leverages multiple GPUs communicating asynchronously to compute patches of an image in multiple computing devices based on the entire image (all patches) in the previous diffusion step. PCPP develops PP to reduce computation in inference by conditioning only on parts of the neighboring patches in each diffusion step, which also decreases communication among computing devices. As a result, PCPP decreases the communication cost by around $70\%$ compared to DistriFusion (the state of the art implementation of PP) and achieves $2.36\sim 8.02\times$ inference speed-up using $4\sim 8$ GPUs compared to $2.32\sim 6.71\times$ achieved by DistriFusion depending on the computing device configuration and resolution of generation at the cost of a possible decrease in image quality. PCPP demonstrates the potential to strike a favorable trade-off, enabling high-quality image generation with substantially reduced latency.
Related papers
- One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation [60.54811860967658]
FluxSR is a novel one-step diffusion Real-ISR based on flow matching models.
First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR.
Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss.
arXiv Detail & Related papers (2025-02-04T04:11:29Z) - Open-Source Acceleration of Stable-Diffusion.cpp Deployable on All Devices [28.774856591172902]
stable-diffusion.Turbo (Sd) emerges as an efficient inference framework to accelerate the diffusion models.
In this work, we present an optimized version of Sd leveraging the Winograd algorithm to accelerate 2D convolution operations.
We demonstrate a speedup up to 2.76x for individual convolutional layers and an inference speedup up to 4.79x for the overall image generation process.
arXiv Detail & Related papers (2024-12-08T02:27:17Z) - AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising [49.785626309848276]
AsyncDiff is a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices.
For the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score.
Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances.
arXiv Detail & Related papers (2024-06-11T03:09:37Z) - DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models [44.384572903945724]
We propose DistriFusion to tackle the problem of generating high-resolution images with diffusion models.
Our method splits the model input into multiple patches and assigns each patch to a GPU.
Our method can be applied to recent Stable Diffusion XL with no quality degradation and achieve up to a 6.1$times$ speedup on eight NVIDIA A100s compared to one.
arXiv Detail & Related papers (2024-02-29T18:59:58Z) - StreamDiffusion: A Pipeline-level Solution for Real-time Interactive
Generation [29.30999290150683]
We introduce StreamDiffusion, a real-time diffusion pipeline for interactive image generation.
Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction.
We present a novel approach that transforms the original sequential denoising into the denoising process.
arXiv Detail & Related papers (2023-12-19T18:18:33Z) - DeepCache: Accelerating Diffusion Models for Free [65.02607075556742]
DeepCache is a training-free paradigm that accelerates diffusion models from the perspective of model architecture.
DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models.
Under the same throughput, DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS.
arXiv Detail & Related papers (2023-12-01T17:01:06Z) - One-step Diffusion with Distribution Matching Distillation [54.723565605974294]
We introduce Distribution Matching Distillation (DMD), a procedure to transform a diffusion model into a one-step image generator.
We enforce the one-step image generator match the diffusion model at distribution level, by minimizing an approximate KL divergence.
Our method outperforms all published few-step diffusion approaches, reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot COCO-30k.
arXiv Detail & Related papers (2023-11-30T18:59:20Z) - ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models [59.90959789767886]
We show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions.
By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 64$times$64 and LSUN Cat 256$times$256 datasets.
arXiv Detail & Related papers (2023-11-23T16:49:06Z) - Continuous Cost Aggregation for Dual-Pixel Disparity Extraction [3.1153758106426603]
We propose a continuous cost aggregation scheme for Dual-Pixel (DP) images.
The proposed algorithm fits parabolas to matching costs and aggregates parabola coefficients along image paths.
Experiments on DP data from both DSLR and phone cameras show that the proposed scheme attains state-of-the-art performance in DP disparity estimation.
arXiv Detail & Related papers (2023-06-13T17:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.