Related papers: ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection

ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection

URL: http://arxiv.org/abs/2310.13545v1
Date: Fri, 20 Oct 2023 14:45:52 GMT
Title: ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection
Authors: Zhongzhan Huang, Pan Zhou, Shuicheng Yan, Liang Lin
Abstract summary: We show that the coefficients of LSCs in UNet have big effects on the stableness of the forward and backward propagation and robustness of UNet. We propose an effective coefficient scaling framework ScaleLong that scales the coefficients of LSC in UNet and better improves the training stability of UNet.
Score: 152.01257690637064
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In diffusion models, UNet is the most popular network backbone, since its long skip connects (LSCs) to connect distant network blocks can aggregate long-distant information and alleviate vanishing gradient. Unfortunately, UNet often suffers from unstable training in diffusion models which can be alleviated by scaling its LSC coefficients smaller. However, theoretical understandings of the instability of UNet in diffusion models and also the performance improvement of LSC scaling remain absent yet. To solve this issue, we theoretically show that the coefficients of LSCs in UNet have big effects on the stableness of the forward and backward propagation and robustness of UNet. Specifically, the hidden feature and gradient of UNet at any layer can oscillate and their oscillation ranges are actually large which explains the instability of UNet training. Moreover, UNet is also provably sensitive to perturbed input, and predicts an output distant from the desired output, yielding oscillatory loss and thus oscillatory gradient. Besides, we also observe the theoretical benefits of the LSC coefficient scaling of UNet in the stableness of hidden features and gradient and also robustness. Finally, inspired by our theory, we propose an effective coefficient scaling framework ScaleLong that scales the coefficients of LSC in UNet and better improves the training stability of UNet. Experimental results on four famous datasets show that our methods are superior to stabilize training and yield about 1.5x training acceleration on different diffusion models with UNet or UViT backbones. Code: https://github.com/sail-sg/ScaleLong

Related papers

Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers [5.187307904567701]
We propose a magnitude-preserving design that stabilizes training without normalization layers.<n>Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation.<n>We show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $sim$12.8%.
arXiv Detail & Related papers (2025-05-25T12:25:50Z)
Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models [21.16132396642158]
Training stability is a persistent challenge in the pre-training of large language models (LLMs) We propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers.
arXiv Detail & Related papers (2025-02-21T14:49:34Z)
Hyperspherical Normalization for Scalable Deep Reinforcement Learning [57.016639036237315]
SimbaV2 is a novel reinforcement learning architecture designed to stabilize optimization. It scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks.
arXiv Detail & Related papers (2025-02-21T08:17:24Z)
Improved Training Technique for Latent Consistency Models [18.617862678160243]
Consistency models are capable of producing high-quality samples in either a single step or multiple steps. We analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers. We introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance.
arXiv Detail & Related papers (2025-02-03T15:25:58Z)
Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning. This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals. It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z)
DelGrad: Exact event-based gradients for training delays and weights on spiking neuromorphic hardware [1.5226147562426895]
Spiking neural networks (SNNs) inherently rely on the timing of signals for representing and processing information. We propose DelGrad, an event-based method to compute exact loss gradients for both synaptic weights and delays. We experimentally demonstrate the memory efficiency and accuracy benefits of adding delays to SNNs on noisy mixed-signal hardware.
arXiv Detail & Related papers (2024-04-30T00:02:34Z)
Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z)
Gradient Reweighting: Towards Imbalanced Class-Incremental Learning [8.438092346233054]
Class-Incremental Learning (CIL) trains a model to continually recognize new classes from non-stationary data. A major challenge of CIL arises when applying to real-world data characterized by non-uniform distribution. We show that this dual imbalance issue causes skewed gradient updates with biased weights in FC layers, thus inducing over/under-fitting and catastrophic forgetting in CIL.
arXiv Detail & Related papers (2024-02-28T18:08:03Z)
Preserving Near-Optimal Gradient Sparsification Cost for Scalable Distributed Deep Learning [0.32634122554914]
gradient sparsification is a potential optimization approach to reduce the communication volume without significant loss of model fidelity. Existing gradient sparsification methods have low scalability owing to inefficient design of their algorithms. We propose a novel gradient sparsification scheme called ExDyna to address these challenges. In experiments, ExDyna outperformed state-of-the-art sparsifiers in terms of training speed and sparsification performance.
arXiv Detail & Related papers (2024-02-21T13:00:44Z)
Low-Light Image Enhancement with Wavelet-based Diffusion Models [50.632343822790006]
Diffusion models have achieved promising results in image restoration tasks, yet suffer from time-consuming, excessive computational resource consumption, and unstable restoration. We propose a robust and efficient Diffusion-based Low-Light image enhancement approach, dubbed DiffLL.
arXiv Detail & Related papers (2023-06-01T03:08:28Z)
Recycling Model Updates in Federated Learning: Are Gradient Subspaces Low-Rank? [26.055358499719027]
We propose the "Look-back Gradient Multiplier" (LBGM) algorithm, which exploits this low-rank property to enable gradient recycling. We analytically characterize the convergence behavior of LBGM, revealing the nature of the trade-off between communication savings and model performance. We show that LBGM is a general plug-and-play algorithm that can be used standalone or stacked on top of existing sparsification techniques for distributed model training.
arXiv Detail & Related papers (2022-02-01T09:05:32Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Federated Stochastic Gradient Langevin Dynamics [12.180900849847252]
gradient MCMC methods, such as gradient Langevin dynamics (SGLD), employ fast but noisy gradient estimates to enable large-scale posterior sampling. We propose conducive gradients, a simple mechanism that combines local likelihood approximations to correct gradient updates. We demonstrate that our approach can handle delayed communication rounds, converging to the target posterior in cases where DSGLD fails.
arXiv Detail & Related papers (2020-04-23T15:25:09Z)
Stabilizing Training of Generative Adversarial Nets via Langevin Stein Variational Gradient Descent [11.329376606876101]
We propose to stabilize GAN training via a novel particle-based variational inference -- Langevin Stein variational descent gradient (LSVGD) We show that LSVGD dynamics has an implicit regularization which is able to enhance particles' spread-out and diversity.
arXiv Detail & Related papers (2020-04-22T11:20:04Z)
On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate. We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.