Related papers: Analyzing and Improving the Training Dynamics of Diffusion Models

Analyzing and Improving the Training Dynamics of Diffusion Models

URL: http://arxiv.org/abs/2312.02696v2
Date: Wed, 20 Mar 2024 12:58:14 GMT
Title: Analyzing and Improving the Training Dynamics of Diffusion Models
Authors: Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, Samuli Laine,
Abstract summary: We identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity.
Score: 36.37845647984578
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling. As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.

Related papers

Rethinking the Role of Dynamic Sparse Training for Scalable Deep Reinforcement Learning [58.533203990515034]
Scaling neural networks has driven breakthrough advances in machine learning, yet this paradigm fails in deep reinforcement learning (DRL)<n>We show that dynamic sparse training strategies provide module-specific benefits that complement the primary scalability foundation established by architectural improvements.<n>We finally distill these insights into Module-Specific Training (MST), a practical framework that exploits the benefits of architectural improvements and demonstrates substantial scalability gains across diverse RL algorithms without algorithmic modifications.
arXiv Detail & Related papers (2025-10-14T03:03:08Z)
Towards the Training of Deeper Predictive Coding Neural Networks [53.15874572081944]
Predictive coding networks trained with equilibrium propagation are neural models that perform inference through an iterative energy process.<n>Previous studies have demonstrated their effectiveness in shallow architectures, but show significant performance degradation when depth exceeds five to seven layers.<n>We show that the reason behind this degradation is due to exponentially imbalanced errors between layers during weight updates, and predictions from the previous layer not being effective in guiding updates in deeper layers.
arXiv Detail & Related papers (2025-06-30T12:44:47Z)
Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning [57.3885832382455]
We show that introducing static network sparsity alone can unlock further scaling potential beyond dense counterparts with state-of-the-art architectures.<n>Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity.
arXiv Detail & Related papers (2025-06-20T17:54:24Z)
Auto-Compressing Networks [51.221103189527014]
We introduce Auto-compression Networks (ACNs), an architectural variant where long feedforward connections from each layer replace traditional short residual connections.<n>We show that ACNs exhibit enhanced noise compared to residual networks, superior performance in low-data settings, and mitigate catastrophic forgetting.<n>These findings establish ACNs as a practical approach to developing efficient neural architectures.
arXiv Detail & Related papers (2025-06-11T13:26:09Z)
Physics-informed Multiple-Input Operators for efficient dynamic response prediction of structures [0.07916635054977067]
MIONet predicts structural responses continuously over both space and time.<n>Model is validated on both a simple beam and the KW-51 bridge, achieving FEM level accuracy within seconds.
arXiv Detail & Related papers (2025-05-11T18:45:58Z)
Deep Linear Network Training Dynamics from Random Initialization: Data, Width, Depth, and Hyperparameter Transfer [40.40780546513363]
We provide descriptions of both non-residual and residual neural networks, the latter of which enables an infinite depth limit when branches are scaled as $1/sqrttextdepth$. We show that this model recovers the accelerated power law training dynamics for power law structured data in the rich regime observed in recent works.
arXiv Detail & Related papers (2025-02-04T17:50:55Z)
Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network) After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference. We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z)
Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating advanced diffusion models (DMs) Existing binarization methods result in significant performance degradation. We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
arXiv Detail & Related papers (2024-06-09T10:30:25Z)
Diffusion-Based Neural Network Weights Generation [80.89706112736353]
D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning. Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation. Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
arXiv Detail & Related papers (2024-02-28T08:34:23Z)
Enhancing Neural Training via a Correlated Dynamics Model [2.9302545029880394]
Correlation Mode Decomposition (CMD) is an algorithm that clusters the parameter space into groups, that display synchronized behavior across epochs. We introduce an efficient CMD variant, designed to run concurrently with training. Our experiments indicate that CMD surpasses the state-of-the-art method for compactly modeled dynamics on image classification.
arXiv Detail & Related papers (2023-12-20T18:22:49Z)
Assessing Neural Network Representations During Training Using Noise-Resilient Diffusion Spectral Entropy [55.014926694758195]
Entropy and mutual information in neural networks provide rich information on the learning process. We leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures. We show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data.
arXiv Detail & Related papers (2023-12-04T01:32:42Z)
Accelerated Training via Incrementally Growing Neural Networks using Variance Transfer and Learning Rate Adaptation [34.7523496790944]
We develop an approach to efficiently grow neural networks, within which parameterization and optimization strategies are designed by considering the training dynamics. We show that our method achieves comparable or better accuracy than training large fixed-size models, while saving a substantial portion of the original budget for training.
arXiv Detail & Related papers (2023-06-22T07:06:45Z)
The Underlying Correlated Dynamics in Neural Training [6.385006149689549]
Training of neural networks is a computationally intensive task. We propose a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality. This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
arXiv Detail & Related papers (2022-12-18T08:34:11Z)
Dual adaptive training of photonic neural networks [30.86507809437016]
Photonic neural network (PNN) computes with photons instead of electrons to feature low latency, high energy efficiency, and high parallelism. Existing training approaches cannot address the extensive accumulation of systematic errors in large-scale PNNs. We propose dual adaptive training ( DAT) that allows the PNN model to adapt to substantial systematic errors.
arXiv Detail & Related papers (2022-12-09T05:03:45Z)
Inverse-Dirichlet Weighting Enables Reliable Training of Physics Informed Neural Networks [2.580765958706854]
We describe and remedy a failure mode that may arise from multi-scale dynamics with scale imbalances during training of deep neural networks. PINNs are popular machine-learning templates that allow for seamless integration of physical equation models with data. For inverse modeling using sequential training, we find that inverse-Dirichlet weighting protects a PINN against catastrophic forgetting.
arXiv Detail & Related papers (2021-07-02T10:01:37Z)
On Robustness and Transferability of Convolutional Neural Networks [147.71743081671508]
Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts. We study the interplay between out-of-distribution and transfer performance of modern image classification CNNs for the first time. We find that increasing both the training set and model sizes significantly improve the distributional shift robustness.
arXiv Detail & Related papers (2020-07-16T18:39:04Z)
Understanding the Effects of Data Parallelism and Sparsity on Neural Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity. Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.