Why do Learning Rates Transfer? Reconciling Optimization and Scaling
Limits for Deep Learning
- URL: http://arxiv.org/abs/2402.17457v1
- Date: Tue, 27 Feb 2024 12:28:01 GMT
- Title: Why do Learning Rates Transfer? Reconciling Optimization and Scaling
Limits for Deep Learning
- Authors: Lorenzo Noci, Alexandru Meterez, Thomas Hofmann, Antonio Orvieto
- Abstract summary: We find empirical evidence that learning rate transfer can be attributed to the fact that under $mu$P and its depth extension, the largest eigenvalue of the training loss Hessian is largely independent of the width and depth of the network.
We show that under the neural tangent kernel (NTK) regime, the sharpness exhibits very different dynamics at different scales, thus preventing learning rate transfer.
- Score: 77.82908213345864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, there has been growing evidence that if the width and depth of a
neural network are scaled toward the so-called rich feature learning limit
($\mu$P and its depth extension), then some hyperparameters - such as the
learning rate - exhibit transfer from small to very large models, thus reducing
the cost of hyperparameter tuning. From an optimization perspective, this
phenomenon is puzzling, as it implies that the loss landscape is remarkably
consistent across very different model sizes. In this work, we find empirical
evidence that learning rate transfer can be attributed to the fact that under
$\mu$P and its depth extension, the largest eigenvalue of the training loss
Hessian (i.e. the sharpness) is largely independent of the width and depth of
the network for a sustained period of training time. On the other hand, we show
that under the neural tangent kernel (NTK) regime, the sharpness exhibits very
different dynamics at different scales, thus preventing learning rate transfer.
But what causes these differences in the sharpness dynamics? Through a
connection between the spectra of the Hessian and the NTK matrix, we argue that
the cause lies in the presence (for $\mu$P) or progressive absence (for the NTK
regime) of feature learning, which results in a different evolution of the NTK,
and thus of the sharpness. We corroborate our claims with a substantial suite
of experiments, covering a wide range of datasets and architectures: from
ResNets and Vision Transformers trained on benchmark vision datasets to
Transformers-based language models trained on WikiText
Related papers
- Plastic Learning with Deep Fourier Features [42.41137083374963]
We identify underlying principles that lead to plastic algorithms.
In particular, we provide theoretical results showing that linear function approximation, as well as a special case of deep linear networks, do not suffer from loss of plasticity.
Deep networks composed entirely of deep Fourier features are highly trainable and sustain their trainability over the course of learning.
arXiv Detail & Related papers (2024-10-27T23:38:06Z) - The lazy (NTK) and rich ($μ$P) regimes: a gentle tutorial [0.0]
A central theme of the modern machine learning paradigm is that larger neural networks achieve better performance on a variety of metrics.
In this tutorial, we provide a nonrigorous but illustrative derivation of the following fact: in order to train wide networks effectively, there is only one degree of freedom.
This degree of freedom controls the richness of training behavior: at minimum, the wide network trains lazily like a kernel machine, and at maximum, it exhibits feature learning.
arXiv Detail & Related papers (2024-04-30T17:11:12Z) - Assessing Neural Network Representations During Training Using
Noise-Resilient Diffusion Spectral Entropy [55.014926694758195]
Entropy and mutual information in neural networks provide rich information on the learning process.
We leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures.
We show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data.
arXiv Detail & Related papers (2023-12-04T01:32:42Z) - Feature-Learning Networks Are Consistent Across Widths At Realistic
Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets.
Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training.
We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z) - Frozen Overparameterization: A Double Descent Perspective on Transfer
Learning of Deep Neural Networks [27.17697714584768]
We study the generalization behavior of transfer learning of deep neural networks (DNNs)
We show that the test error evolution during the target training has a more significant double descent effect when the target training dataset is sufficiently large.
Also, we show that the double descent phenomenon may make a transfer from a less related source task better than a transfer from a more related source task.
arXiv Detail & Related papers (2022-11-20T20:26:23Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - Understanding Self-supervised Learning with Dual Deep Networks [74.92916579635336]
We propose a novel framework to understand contrastive self-supervised learning (SSL) methods that employ dual pairs of deep ReLU networks.
We prove that in each SGD update of SimCLR with various loss functions, the weights at each layer are updated by a emphcovariance operator.
To further study what role the covariance operator plays and which features are learned in such a process, we model data generation and augmentation processes through a emphhierarchical latent tree model (HLTM)
arXiv Detail & Related papers (2020-10-01T17:51:49Z) - The large learning rate phase of deep learning: the catapult mechanism [50.23041928811575]
We present a class of neural networks with solvable training dynamics.
We find good agreement between our model's predictions and training dynamics in realistic deep learning settings.
We believe our results shed light on characteristics of models trained at different learning rates.
arXiv Detail & Related papers (2020-03-04T17:52:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.