db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism
- URL: http://arxiv.org/abs/2511.23113v1
- Date: Fri, 28 Nov 2025 11:55:46 GMT
- Title: db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism
- Authors: Siqi Chen, Ke Hong, Tianchen Zhao, Ruiqi Xie, Zhenhua Zhu, Xudong Zhang, Yu Wang,
- Abstract summary: Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation.<n>We formalize a sparse imbalance ratio to quantify the imbalance, and propose db-SP, a sparsity-aware sequence parallelism technique.<n>We show that db-SP delivers an end-to-end speedup of 1.25x and an attention-specific speedup of 1.40x over state-of-the-art sequence parallel methods.
- Score: 14.406306253079515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The imbalance stems from the inherent variation in sparsity across attention heads and the irregular distribution of dense blocks within the sparse mask, when sequence parallelism is applied along the head dimension (as in Ulysses) or the block dimension (as in Ring Attention). In this paper, we formalize a sparse imbalance ratio to quantify the imbalance, and propose db-SP, a sparsity-aware sequence parallelism technique that tackles the challenge. db-SP contains a dual-level partitioning approach that achieves near-perfect workload balance at both the head and block levels with negligible overhead. Furthermore, to handle the evolving sparsity patterns across denoising steps and layers, db-SP dynamically determines the parallel degrees for the head and block dimensions at runtime. Experimental results demonstrate that db-SP delivers an end-to-end speedup of 1.25x and an attention-specific speedup of 1.40x over state-of-the-art sequence parallel methods on average. Code is available at https://github.com/thu-nics/db-SP.
Related papers
- Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling [10.012655130147413]
Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation.<n>Our framework achieves $2.31times$ and $2.07times$ latency reductions on SDXL and SD3, respectively.<n>Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings.
arXiv Detail & Related papers (2026-02-25T10:23:07Z) - Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers [36.650880799066215]
Asynchronous approaches suffer from computational redundancy, while layer-wise partitioning fails to reconcile this conflict.<n>For Data Parallelism, we introduce an alpha-Balanced Static Partitioning strategy that respects atomicity while neutralizing the load imbalance.<n>Our approach achieves a 1.57x speedup in end-to-end time and reducing step latency by 5.8x compared to the baseline.
arXiv Detail & Related papers (2026-02-04T07:38:24Z) - Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing [76.48164395646019]
Parallel-Probe is a training-free controller designed to optimize online parallel thinking.<n>It reduces sequential tokens by up to $textbf35.8$% and total token cost by over $textbf25.8$% while maintaining competitive accuracy.
arXiv Detail & Related papers (2026-02-03T18:59:41Z) - ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models [99.6720868215076]
We introduce ThreadWeaver, a framework for adaptive parallel reasoning.<n> ThreadWeaver achieves accuracy on par with popular sequential reasoning models of comparable size.<n>We show that ThreadWeaver delivers up to 1.53x average speedup in token latency.
arXiv Detail & Related papers (2025-11-24T18:55:59Z) - Higher-order Linear Attention [59.92962330635185]
quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts.<n>We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics.
arXiv Detail & Related papers (2025-10-31T07:54:37Z) - Beyond Surface Reasoning: Unveiling the True Long Chain-of-Thought Capacity of Diffusion Large Language Models [54.81955614221652]
parallel decoding, which enables simultaneous token updates, conflicts with the causal order often required for rigorous reasoning.<n> Behavioral analyses in both simple and complex reasoning tasks show thatDLLMs exhibit genuine parallelism only for directly decidable outputs.<n>We propose several practical mitigations, parallel-oriented prompting, diffusion early stopping, and parallel scaling, to reduce PSC-induced ineffectiveness and inefficiencies.
arXiv Detail & Related papers (2025-10-10T16:58:14Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - ATTS: Asynchronous Test-Time Scaling via Conformal Prediction [112.54016379556073]
Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency.<n>We introduce ATTS (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive scaling framework.<n>We show that ATTS delivers up to 56.7x speedup in test-time scaling and a 4.14x throughput improvement.
arXiv Detail & Related papers (2025-09-18T16:55:09Z) - Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks [5.877451898618022]
FlowHN is a novel parallel hybrid network architecture that accommodates various strategies for load balancing.<n>Two innovative differentiating factors in FlowHN include a FLOP aware dynamic token split between the attention and SSM branches.
arXiv Detail & Related papers (2025-05-26T03:52:22Z) - Two-dimensional Parallel Tempering for Constrained Optimization [0.3068068202044424]
We introduce a two-dimensional extension of the powerful parallel tempering algorithm (PT)<n>The resulting two-dimensional parallel tempering algorithm (2D-PT) improves mixing in heavily constrained replicas.<n>The method applies broadly to constrained Ising problems and can be deployed on existing Ising machines.
arXiv Detail & Related papers (2025-05-24T20:41:45Z) - Dynamic Dual Trainable Bounds for Ultra-low Precision Super-Resolution
Networks [82.18396309806577]
We propose a novel activation quantizer, referred to as Dynamic Dual Trainable Bounds (DDTB)
Our DDTB exhibits significant performance improvements in ultra-low precision.
For example, our DDTB achieves a 0.70dB PSNR increase on Urban100 benchmark when quantizing EDSR to 2-bit and scaling up output images to x4.
arXiv Detail & Related papers (2022-03-08T04:26:18Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.