Dissecting adaptive methods in GANs
- URL: http://arxiv.org/abs/2210.04319v1
- Date: Sun, 9 Oct 2022 19:00:07 GMT
- Title: Dissecting adaptive methods in GANs
- Authors: Samy Jelassi, David Dobre, Arthur Mensch, Yuanzhi Li, Gauthier Gidel
- Abstract summary: We study how adaptive methods help train generative adversarial networks (GANs)
By considering an update rule with the magnitude of the Adam update and the normalized direction of SGD, we empirically show that the adaptive magnitude of Adam is key for GAN training.
We prove that in that setting, GANs trained with nSGDA recover all the modes of the true distribution, whereas the same networks trained with SGDA (and any learning rate configuration) suffer from mode collapse.
- Score: 46.90376306847234
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adaptive methods are a crucial component widely used for training generative
adversarial networks (GANs). While there has been some work to pinpoint the
"marginal value of adaptive methods" in standard tasks, it remains unclear why
they are still critical for GAN training. In this paper, we formally study how
adaptive methods help train GANs; inspired by the grafting method proposed in
arXiv:2002.11803 [cs.LG], we separate the magnitude and direction components of
the Adam updates, and graft them to the direction and magnitude of SGDA updates
respectively. By considering an update rule with the magnitude of the Adam
update and the normalized direction of SGD, we empirically show that the
adaptive magnitude of Adam is key for GAN training. This motivates us to have a
closer look at the class of normalized stochastic gradient descent ascent
(nSGDA) methods in the context of GAN training. We propose a synthetic
theoretical framework to compare the performance of nSGDA and SGDA for GAN
training with neural networks. We prove that in that setting, GANs trained with
nSGDA recover all the modes of the true distribution, whereas the same networks
trained with SGDA (and any learning rate configuration) suffer from mode
collapse. The critical insight in our analysis is that normalizing the
gradients forces the discriminator and generator to be updated at the same
pace. We also experimentally show that for several datasets, Adam's performance
can be recovered with nSGDA methods.
Related papers
- Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates [3.6185342807265415]
Deep learning algorithms are the key ingredients in many artificial intelligence (AI) systems.
Deep learning algorithms are typically consisting of a class of deep neural networks trained by a gradient descent (SGD) optimization method.
arXiv Detail & Related papers (2024-07-11T00:10:35Z) - DRED: Zero-Shot Transfer in Reinforcement Learning via Data-Regularised Environment Design [11.922951794283168]
In this work, we investigate how the sampling of individual environment instances, or levels, affects the zero-shot generalisation (ZSG) ability of RL agents.
We discover that for deep actor-critic architectures sharing their base layers, prioritising levels according to their value loss minimises the mutual information between the agent's internal representation and the set of training levels in the generated training data.
We find that existing UED methods can significantly shift the training distribution, which translates to low ZSG performance.
To prevent both overfitting and distributional shift, we introduce data-regularised environment design (D
arXiv Detail & Related papers (2024-02-05T19:47:45Z) - Understanding the robustness difference between stochastic gradient
descent and adaptive gradient methods [11.895321856533934]
gradient descent (SGD) and adaptive gradient methods have been widely used in training deep neural networks.
We empirically show that while the difference between the standard generalization performance of models trained using these methods is small, those trained using SGD exhibit far greater robustness under input perturbations.
arXiv Detail & Related papers (2023-08-13T07:03:22Z) - PDE+: Enhancing Generalization via PDE with Adaptive Distributional
Diffusion [66.95761172711073]
generalization of neural networks is a central challenge in machine learning.
We propose to enhance it directly through the underlying function of neural networks, rather than focusing on adjusting input data.
We put this theoretical framework into practice as $textbfPDE+$ ($textbfPDE$ with $textbfA$daptive $textbfD$istributional $textbfD$iffusion)
arXiv Detail & Related papers (2023-05-25T08:23:26Z) - Local Convergence of Gradient Descent-Ascent for Training Generative
Adversarial Networks [20.362912591032636]
We study the local dynamics of gradient descent-ascent (GDA) for training a GAN with a kernel-based discriminator.
We show phase transitions that indicate when the system converges, oscillates, or diverges.
arXiv Detail & Related papers (2023-05-14T23:23:08Z) - LD-GAN: Low-Dimensional Generative Adversarial Network for Spectral
Image Generation with Variance Regularization [72.4394510913927]
Deep learning methods are state-of-the-art for spectral image (SI) computational tasks.
GANs enable diverse augmentation by learning and sampling from the data distribution.
GAN-based SI generation is challenging since the high-dimensionality nature of this kind of data hinders the convergence of the GAN training yielding to suboptimal generation.
We propose a statistical regularization to control the low-dimensional representation variance for the autoencoder training and to achieve high diversity of samples generated with the GAN.
arXiv Detail & Related papers (2023-04-29T00:25:02Z) - Robust Training of Neural Networks using Scale Invariant Architectures [70.67803417918854]
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks.
We show that this general approach is robust to rescaling of parameter and loss.
We design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam.
arXiv Detail & Related papers (2022-02-02T11:58:56Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.