A Random Matrix Theory Approach to Damping in Deep Learning
- URL: http://arxiv.org/abs/2011.08181v5
- Date: Wed, 16 Mar 2022 15:02:53 GMT
- Title: A Random Matrix Theory Approach to Damping in Deep Learning
- Authors: Diego Granziol, Nicholas Baskerville
- Abstract summary: We conjecture that the inherent difference in generalisation between adaptive and non-adaptive gradient methods in deep learning stems from the increased estimation noise.
We develop a novel random matrix theory based damping learner for second order optimiser inspired by linear shrinkage estimation.
- Score: 0.7614628596146599
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We conjecture that the inherent difference in generalisation between adaptive
and non-adaptive gradient methods in deep learning stems from the increased
estimation noise in the flattest directions of the true loss surface. We
demonstrate that typical schedules used for adaptive methods (with low
numerical stability or damping constants) serve to bias relative movement
towards flat directions relative to sharp directions, effectively amplifying
the noise-to-signal ratio and harming generalisation. We further demonstrate
that the numerical damping constant used in these methods can be decomposed
into a learning rate reduction and linear shrinkage of the estimated curvature
matrix. We then demonstrate significant generalisation improvements by
increasing the shrinkage coefficient, closing the generalisation gap entirely
in both logistic regression and several deep neural network experiments.
Extending this line further, we develop a novel random matrix theory based
damping learner for second order optimiser inspired by linear shrinkage
estimation. We experimentally demonstrate our learner to be very insensitive to
the initialised value and to allow for extremely fast convergence in
conjunction with continued stable training and competitive generalisation.
Related papers
- Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions.
Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms.
We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Hebbian learning inspired estimation of the linear regression parameters
from queries [18.374824005225186]
We study a variation of this Hebbian learning rule to recover the regression vector in the linear regression model.
We prove that this Hebbian learning rule can achieve considerably faster rates than any non-adaptive method that selects the queries independently of the data.
arXiv Detail & Related papers (2023-09-26T19:00:32Z) - Iterative regularization in classification via hinge loss diagonal descent [12.684351703991965]
Iterative regularization is a classic idea in regularization theory, that has recently become popular in machine learning.
In this paper, we focus on iterative regularization in the context of classification.
arXiv Detail & Related papers (2022-12-24T07:15:26Z) - On the generalization of learning algorithms that do not converge [54.122745736433856]
Generalization analyses of deep learning typically assume that the training converges to a fixed point.
Recent results indicate that in practice, the weights of deep neural networks optimized with gradient descent often oscillate indefinitely.
arXiv Detail & Related papers (2022-08-16T21:22:34Z) - Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of
Stochasticity [24.428843425522107]
We study the dynamics of gradient descent over diagonal linear networks through its continuous time version, namely gradient flow.
We show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias.
arXiv Detail & Related papers (2021-06-17T14:16:04Z) - From inexact optimization to learning via gradient concentration [22.152317081922437]
In this paper, we investigate the phenomenon in the context of linear models with smooth loss functions.
We propose a proof technique combining ideas from inexact optimization and probability theory, specifically gradient concentration.
arXiv Detail & Related papers (2021-06-09T21:23:29Z) - Learning Rates as a Function of Batch Size: A Random Matrix Theory
Approach to Neural Network Training [2.9649783577150837]
We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory.
We derive analytical expressions for the maximal descent and adaptive training regimens for smooth, non-Newton deep neural networks.
We validate our claims on the VGG/ResNet and ImageNet datasets.
arXiv Detail & Related papers (2020-06-16T11:55:45Z) - Disentangling Adaptive Gradient Methods from Learning Rates [65.0397050979662]
We take a deeper look at how adaptive gradient methods interact with the learning rate schedule.
We introduce a "grafting" experiment which decouples an update's magnitude from its direction.
We present some empirical and theoretical retrospectives on the generalization of adaptive gradient methods.
arXiv Detail & Related papers (2020-02-26T21:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.