Analyzing and Improving the Training Dynamics of Diffusion Models
- URL: http://arxiv.org/abs/2312.02696v2
- Date: Wed, 20 Mar 2024 12:58:14 GMT
- Title: Analyzing and Improving the Training Dynamics of Diffusion Models
- Authors: Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, Samuli Laine,
- Abstract summary: We identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture.
We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity.
- Score: 36.37845647984578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling. As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.
Related papers
- Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)
After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.
We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z) - Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating advanced diffusion models (DMs)
Existing binarization methods result in significant performance degradation.
We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
arXiv Detail & Related papers (2024-06-09T10:30:25Z) - Diffusion-Based Neural Network Weights Generation [80.89706112736353]
D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning.
Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation.
Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
arXiv Detail & Related papers (2024-02-28T08:34:23Z) - Enhancing Neural Training via a Correlated Dynamics Model [2.9302545029880394]
Correlation Mode Decomposition (CMD) is an algorithm that clusters the parameter space into groups, that display synchronized behavior across epochs.
We introduce an efficient CMD variant, designed to run concurrently with training.
Our experiments indicate that CMD surpasses the state-of-the-art method for compactly modeled dynamics on image classification.
arXiv Detail & Related papers (2023-12-20T18:22:49Z) - Assessing Neural Network Representations During Training Using
Noise-Resilient Diffusion Spectral Entropy [55.014926694758195]
Entropy and mutual information in neural networks provide rich information on the learning process.
We leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures.
We show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data.
arXiv Detail & Related papers (2023-12-04T01:32:42Z) - Accelerated Training via Incrementally Growing Neural Networks using
Variance Transfer and Learning Rate Adaptation [34.7523496790944]
We develop an approach to efficiently grow neural networks, within which parameterization and optimization strategies are designed by considering the training dynamics.
We show that our method achieves comparable or better accuracy than training large fixed-size models, while saving a substantial portion of the original budget for training.
arXiv Detail & Related papers (2023-06-22T07:06:45Z) - The Underlying Correlated Dynamics in Neural Training [6.385006149689549]
Training of neural networks is a computationally intensive task.
We propose a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality.
This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
arXiv Detail & Related papers (2022-12-18T08:34:11Z) - Dual adaptive training of photonic neural networks [30.86507809437016]
Photonic neural network (PNN) computes with photons instead of electrons to feature low latency, high energy efficiency, and high parallelism.
Existing training approaches cannot address the extensive accumulation of systematic errors in large-scale PNNs.
We propose dual adaptive training ( DAT) that allows the PNN model to adapt to substantial systematic errors.
arXiv Detail & Related papers (2022-12-09T05:03:45Z) - Inverse-Dirichlet Weighting Enables Reliable Training of Physics
Informed Neural Networks [2.580765958706854]
We describe and remedy a failure mode that may arise from multi-scale dynamics with scale imbalances during training of deep neural networks.
PINNs are popular machine-learning templates that allow for seamless integration of physical equation models with data.
For inverse modeling using sequential training, we find that inverse-Dirichlet weighting protects a PINN against catastrophic forgetting.
arXiv Detail & Related papers (2021-07-02T10:01:37Z) - On Robustness and Transferability of Convolutional Neural Networks [147.71743081671508]
Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts.
We study the interplay between out-of-distribution and transfer performance of modern image classification CNNs for the first time.
We find that increasing both the training set and model sizes significantly improve the distributional shift robustness.
arXiv Detail & Related papers (2020-07-16T18:39:04Z) - Understanding the Effects of Data Parallelism and Sparsity on Neural
Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity.
Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.