Related papers: You Only Look One Step: Accelerating Backpropagation in Diffusion Sampling with Gradient Shortcuts

You Only Look One Step: Accelerating Backpropagation in Diffusion Sampling with Gradient Shortcuts

URL: http://arxiv.org/abs/2505.07477v1
Date: Mon, 12 May 2025 12:09:11 GMT
Title: You Only Look One Step: Accelerating Backpropagation in Diffusion Sampling with Gradient Shortcuts
Authors: Hongkun Dou, Zeyu Li, Xingyu Jiang, Hongjue Li, Lijun Yang, Wen Yao, Yue Deng,
Abstract summary: Diffusion models (DMs) have recently demonstrated remarkable success in modeling large-scale data distributions.<n>Many downstream tasks require guiding the generated content based on specific differentiable metrics, typically necessitating backpropagation during the generation process.<n>We propose a more efficient alternative that approaches the problem from the perspective of parallel denoising.
Score: 13.191937642688279
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models (DMs) have recently demonstrated remarkable success in modeling large-scale data distributions. However, many downstream tasks require guiding the generated content based on specific differentiable metrics, typically necessitating backpropagation during the generation process. This approach is computationally expensive, as generating with DMs often demands tens to hundreds of recursive network calls, resulting in high memory usage and significant time consumption. In this paper, we propose a more efficient alternative that approaches the problem from the perspective of parallel denoising. We show that full backpropagation throughout the entire generation process is unnecessary. The downstream metrics can be optimized by retaining the computational graph of only one step during generation, thus providing a shortcut for gradient propagation. The resulting method, which we call Shortcut Diffusion Optimization (SDO), is generic, high-performance, and computationally lightweight, capable of optimizing all parameter types in diffusion sampling. We demonstrate the effectiveness of SDO on several real-world tasks, including controlling generation by optimizing latent and aligning the DMs by fine-tuning network parameters. Compared to full backpropagation, our approach reduces computational costs by $\sim 90\%$ while maintaining superior performance. Code is available at https://github.com/deng-ai-lab/SDO.

Related papers

Low-rank Momentum Factorization for Memory Efficient Training [13.464518325870444]
Momentum Factorized (MoFaSGD) maintains a dynamically updated low-rank SVD representation of the first-order momentum.<n>We demonstrate MoFaSGD's effectiveness on large language model benchmarks, achieving a competitive trade-off between memory reduction (e.g. LoRA) and performance.
arXiv Detail & Related papers (2025-07-10T18:04:52Z)
Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models [50.260693393896716]
Diffusion models (DMs) are powerful generative models capable of producing high-fidelity images but constrained by high computational costs.<n>We propose Flexiffusion, a training-free NAS framework that jointly optimize generation schedules and model architectures without modifying pre-trained parameters.<n>Our work pioneers a resource-efficient paradigm for searching high-speed DMs without sacrificing quality.
arXiv Detail & Related papers (2025-06-03T06:02:50Z)
Optimal Stepsize for Diffusion Sampling [14.849487881523041]
Diffusion models achieve remarkable generation quality but suffer from computational intensive sampling due to suboptimal step discretization.<n>This paper proposes Optimal Stepsize Distillation, a dynamic programming framework that extracts theoretically optimal schedules by distilling knowledge from reference trajectories.<n> Experiments show 10x accelerated text-to-image generation while preserving 99.4% performance on GenEval.
arXiv Detail & Related papers (2025-03-27T17:59:46Z)
Recurrent Diffusion for Large-Scale Parameter Generation [52.98888368644455]
We introduce Recurrent Diffusion for Large Scale Generation (RPG), a novel framework that generates full neural network parameters up to hundreds of millions on a single GPU.<n>RPG serves as a critical advance in AI generating AI, potentially enabling efficient weight generation at scales previously deemed infeasible.
arXiv Detail & Related papers (2025-01-20T16:46:26Z)
LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers [79.07412045476872]
Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks.<n>We show that performing the full of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps.<n>We propose a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations.
arXiv Detail & Related papers (2024-12-17T01:12:35Z)
DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization [22.546989373687655]
We propose a novel pruning method that derives an efficient diffusion model via a more intelligent and differentiable pruner. Our approach achieves 4.4 x speedup for SD-1.5 without any loss of accuracy, significantly outperforming the previous state-of-the-art methods.
arXiv Detail & Related papers (2024-10-22T12:18:24Z)
Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates [1.9241821314180372]
Asynchronous gradient descent (ASGD) methods can improve training speed, but are sensitive to delays due to both communication and throughput differences.<n>PD-ASGD uses separate threads for the forward and backward passes, decoupling the updates and allowing for a higher ratio of forward to backward threads.<n>Our approach yields close to state-of-the-art results while running up to $5.95times$ faster than synchronous data parallelism in the presence of delays.
arXiv Detail & Related papers (2024-10-08T12:32:36Z)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters. We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z)
Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching. Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z)
Learning to Efficiently Sample from Diffusion Probabilistic Models [49.58748345998702]
Denoising Diffusion Probabilistic Models (DDPMs) can yield high-fidelity samples and competitive log-likelihoods across a range of domains. We introduce an exact dynamic programming algorithm that finds the optimal discrete time schedules for any pre-trained DDPM.
arXiv Detail & Related papers (2021-06-07T17:15:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.