PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling
- URL: http://arxiv.org/abs/2511.12056v1
- Date: Sat, 15 Nov 2025 06:46:40 GMT
- Title: PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling
- Authors: Sijie Wang, Qiang Wang, Shaohuai Shi,
- Abstract summary: diffusion transformer (DiT) based models have demonstrated remark- able capabilities.<n>However, their practical deployment is hindered by slow inference speeds and high memory con- sumption.<n>We propose a novel pipelining frame- work named PipeDiT to accelerate video generation.
- Score: 18.079843329153412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remark- able capabilities. However, their practical deployment is of- ten hindered by slow inference speeds and high memory con- sumption. In this paper, we propose a novel pipelining frame- work named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and commu- nication among multiple GPUs to be pipelined, thus reduc- ing inference latency. Second, we propose DeDiVAE to de- couple the diffusion module and the variational autoencoder (VAE) module into two GPU groups, whose executions can also be pipelined to reduce memory consumption and infer- ence latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and Hun- yuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8- GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06x to 4.02x speedups over OpenSoraPlan and HunyuanVideo.
Related papers
- Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling [10.012655130147413]
Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation.<n>Our framework achieves $2.31times$ and $2.07times$ latency reductions on SDXL and SD3, respectively.<n>Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings.
arXiv Detail & Related papers (2026-02-25T10:23:07Z) - Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention [63.69228529380251]
Spava is a sequence-parallel framework with optimized attention for long-video inference.<n>Spava delivers speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss.
arXiv Detail & Related papers (2026-01-29T09:23:13Z) - StreamFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs [8.844450350128362]
Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation.<n>StreamFusion is a topology-aware efficient DiT serving engine.<n>Our experiments demonstrate that StreamFusion outperforms the state-of-the-art approach by an average of $1.35times$ (up to $1.77times$)
arXiv Detail & Related papers (2026-01-28T05:42:07Z) - PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing [29.552187111796403]
We propose PipeFlow, a scalable, pipelined video editing method.<n>Based on a motion analysis, we identify and propose to skip editing of frames with low motion.<n>Our method uniquely scales to longer videos by dividing them into smaller segments, allowing PipeFlow's editing time to increase linearly with video length.
arXiv Detail & Related papers (2025-12-30T06:54:57Z) - StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation [65.90400162290057]
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered.<n>Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation.<n>Live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter.
arXiv Detail & Related papers (2025-11-10T18:51:28Z) - Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z) - Magic 1-For-1: Generating One Minute Video Clips within One Minute [53.07214657235465]
We present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency.<n>By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics.
arXiv Detail & Related papers (2025-02-11T16:58:15Z) - Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile [28.913893318345384]
Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps.<n>This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data, and 2) Shorten the sampling process by adopting existing multi-step consistency distillation.
arXiv Detail & Related papers (2025-02-10T05:00:56Z) - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [48.35054927704544]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.<n>Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z) - BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training [5.7294516069851475]
BitPipe is a bidirectional interleaved pipeline parallelism for accelerating large models training.
We show that BitPipe improves the training throughput of GPT-style and BERT-style models by 1.05x-1.28x compared to the state-of-the-art synchronous approaches.
arXiv Detail & Related papers (2024-10-25T08:08:51Z) - Video-Infinity: Distributed Long Video Generation [73.30145218077074]
Diffusion models have recently achieved remarkable results for video generation.
Our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods.
arXiv Detail & Related papers (2024-06-24T01:56:12Z) - StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation [52.56469577812338]
We introduce StreamDiffusion, a real-time diffusion pipeline for interactive image generation.<n>Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction.<n>We present a novel approach that transforms the original sequential denoising into the denoising process.
arXiv Detail & Related papers (2023-12-19T18:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.