Related papers: Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism

Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism

URL: http://arxiv.org/abs/2505.14741v1
Date: Tue, 20 May 2025 06:58:40 GMT
Title: Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism
Authors: Kunyun Wang, Bohan Li, Kai Yu, Minyi Guo, Jieru Zhao,
Abstract summary: Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis.<n>We propose textbfParaStep, a novel parallelization method based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting similarity between adjacent denoising steps.<n>ParaStep achieves end-to-end speedups of up to textbf3.88$times$ on SVD, textbf2.43$times$ on CogVideoX-2b, and textbf6.56$times
Score: 18.655659400456848
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis. However, their deployment is often limited by significant inference latency, primarily due to the inherently sequential nature of the denoising process. While existing parallelization strategies attempt to accelerate inference by distributing computation across multiple devices, they typically incur high communication overhead, hindering deployment on commercial hardware. To address this challenge, we propose \textbf{ParaStep}, a novel parallelization method based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting similarity between adjacent denoising steps. Unlike prior approaches that rely on layer-wise or stage-wise communication, ParaStep employs lightweight, step-wise communication, substantially reducing overhead. ParaStep achieves end-to-end speedups of up to \textbf{3.88}$\times$ on SVD, \textbf{2.43}$\times$ on CogVideoX-2b, and \textbf{6.56}$\times$ on AudioLDM2-large, while maintaining generation quality. These results highlight ParaStep as a scalable and communication-efficient solution for accelerating diffusion inference, particularly in bandwidth-constrained environments.

Related papers

READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z)
Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion [16.99620863197586]
Diffusion language models offer parallel token generation and inherent bidirectionality.<n>State-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference.<n>We introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking.<n>For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models.
arXiv Detail & Related papers (2025-05-27T17:39:39Z)
AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse [19.13826316844611]
Diffusion models have demonstrated remarkable success in generative tasks, yet their iterative denoising process results in slow inference.<n>We provide a theoretical understanding by analyzing the denoising process through the second-order Adams-Bashforth method.<n>We propose a novel caching-based acceleration approach for diffusion models, instead of directly reusing cached results.
arXiv Detail & Related papers (2025-04-13T08:29:58Z)
DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap [38.5017989456818]
DiffGAP is a novel approach incorporating a lightweight generative module within the contrastive space.<n>Our experimental results on VGGSound and AudioCaps datasets demonstrate that DiffGAP significantly improves performance in video/text-audio generation and retrieval tasks.
arXiv Detail & Related papers (2025-03-15T13:24:09Z)
Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling [81.37449968164692]
We propose Synchronized Coupled Sampling (SynCoS), a novel inference framework that synchronizes denoising paths across the entire video.<n>Our approach combines two complementary sampling strategies, which ensure seamless local transitions and enforce global coherence.<n>Extensive experiments show that SynCoS significantly improves multi-event long video generation, achieving smoother transitions and superior long-range coherence.
arXiv Detail & Related papers (2025-03-11T16:43:45Z)
One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.<n>To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.<n>Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z)
Joint Transmit and Pinching Beamforming for Pinching Antenna Systems (PASS): Optimization-Based or Learning-Based? [89.05848771674773]
A novel antenna system ()-enabled downlink multi-user multiple-input single-output (MISO) framework is proposed.<n>It consists of multiple waveguides, which equip numerous low-cost antennas, named (PAs)<n>The positions of PAs can be reconfigured to both spanning large-scale path and space.
arXiv Detail & Related papers (2025-02-12T18:54:10Z)
Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference [14.805702987440512]
We introduce Flash Communication, a novel low-bit compression technique designed to alleviate the tensor-parallelism communication bottleneck during inference.<n>Our method substantially boosts intra-node communication speed by more than 3x and reduces the time-to-first-token by 2x, with nearly no sacrifice in model accuracy.
arXiv Detail & Related papers (2024-12-06T11:29:32Z)
Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [83.48423407316713]
We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. Our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail. Our method achieves a state-of-the-art FID score of 2.01 when integrated with the recent work SiT.
arXiv Detail & Related papers (2024-08-11T07:01:39Z)
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models [44.384572903945724]
We propose DistriFusion to tackle the problem of generating high-resolution images with diffusion models. Our method splits the model input into multiple patches and assigns each patch to a GPU. Our method can be applied to recent Stable Diffusion XL with no quality degradation and achieve up to a 6.1$times$ speedup on eight NVIDIA A100s compared to one.
arXiv Detail & Related papers (2024-02-29T18:59:58Z)
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation [52.56469577812338]
We introduce StreamDiffusion, a real-time diffusion pipeline for interactive image generation.<n>Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction.<n>We present a novel approach that transforms the original sequential denoising into the denoising process.
arXiv Detail & Related papers (2023-12-19T18:18:33Z)
Speech Synthesis By Unrolling Diffusion Process using Neural Network Layers [3.2634122554914002]
UDPNet is a novel architecture designed to accelerate the reverse diffusion process in speech synthesis.<n>We show that UDPNet consistently outperforms state-of-the-art methods in both quality and efficiency.<n>These results position UDPNet as a robust solution for real-time speech synthesis applications.
arXiv Detail & Related papers (2023-09-18T10:35:27Z)
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.