Related papers: Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

URL: http://arxiv.org/abs/2602.21760v1
Date: Wed, 25 Feb 2026 10:23:07 GMT
Title: Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling
Authors: Euisoo Jung, Byunghyun Kim, Hyunjin Kim, Seonghye Cho, Jae-Gil Lee,
Abstract summary: Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation.<n>Our framework achieves $2.31times$ and $2.07times$ latency reductions on SDXL and SD3, respectively.<n>Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings.
Score: 10.012655130147413
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achieve substantial acceleration proportional to the number of GPUs. Therefore, we propose a hybrid parallelism framework that combines a novel data parallel strategy, condition-based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models. The key ideas are to (i) leverage the conditional and unconditional denoising paths as a new data-partitioning perspective and (ii) adaptively enable optimal pipeline parallelism according to the denoising discrepancy between these two paths. Our framework achieves $2.31\times$ and $2.07\times$ latency reductions on SDXL and SD3, respectively, using two NVIDIA RTX~3090 GPUs, while preserving image quality. This result confirms the generality of our approach across U-Net-based diffusion models and DiT-based flow-matching architectures. Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings. Code is available at https://github.com/kaist-dmlab/Hybridiff.

Related papers

AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism [54.8494905524997]
We introduce asynchronous updates across both parallelism axes, relaxing the co-location requirement.<n>We provide convergence guarantees for both sparse averaging and asynchronous updates.<n>Experiments on large-scale language models demonstrate that our approach matches the performance of the fully synchronous baseline.
arXiv Detail & Related papers (2026-01-30T01:24:47Z)
StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation [65.90400162290057]
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered.<n>Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation.<n>Live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter.
arXiv Detail & Related papers (2025-11-10T18:51:28Z)
STADI: Fine-Grained Step-Patch Diffusion Parallelism for Heterogeneous GPUs [14.137795556562686]
This paper introduces Spatio-Temporal Adaptive Diffusion Inference (STADI), a novel framework to accelerate diffusion model inference.<n>At its core is a hybrid scheduler that orchestrates fine-grained parallelism across both temporal and spatial dimensions.<n>Our method significantly reduces end-to-end inference latency by up to 45% and significantly improves resource utilization on heterogeneous GPUs.
arXiv Detail & Related papers (2025-09-05T00:25:40Z)
Accelerating Diffusion LLMs via Adaptive Parallel Decoding [60.407727995313074]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z)
Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism [26.365397387678396]
Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis.<n>We propose textbfParaStep, a novel parallelization method based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting similarity between adjacent denoising steps.<n>ParaStep achieves end-to-end speedups of up to textbf3.88$times$ on SVD, textbf2.43$times$ on CogVideoX-2b, and textbf6.56$times
arXiv Detail & Related papers (2025-05-20T06:58:40Z)
Nesterov Method for Asynchronous Pipeline Parallel Optimization [59.79227116582264]
We introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in Pipeline Parallelism.<n>Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients.<n>We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients.
arXiv Detail & Related papers (2025-05-02T08:23:29Z)
BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training [5.7294516069851475]
BitPipe is a bidirectional interleaved pipeline parallelism for accelerating large models training. We show that BitPipe improves the training throughput of GPT-style and BERT-style models by 1.05x-1.28x compared to the state-of-the-art synchronous approaches.
arXiv Detail & Related papers (2024-10-25T08:08:51Z)
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising [49.785626309848276]
AsyncDiff is a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. For the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score. Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances.
arXiv Detail & Related papers (2024-06-11T03:09:37Z)
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models [44.384572903945724]
We propose DistriFusion to tackle the problem of generating high-resolution images with diffusion models. Our method splits the model input into multiple patches and assigns each patch to a GPU. Our method can be applied to recent Stable Diffusion XL with no quality degradation and achieve up to a 6.1$times$ speedup on eight NVIDIA A100s compared to one.
arXiv Detail & Related papers (2024-02-29T18:59:58Z)
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation [52.56469577812338]
We introduce StreamDiffusion, a real-time diffusion pipeline for interactive image generation.<n>Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction.<n>We present a novel approach that transforms the original sequential denoising into the denoising process.
arXiv Detail & Related papers (2023-12-19T18:18:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.