Super Consistency of Neural Network Landscapes and Learning Rate Transfer
- URL: http://arxiv.org/abs/2402.17457v2
- Date: Wed, 13 Nov 2024 00:38:48 GMT
- Title: Super Consistency of Neural Network Landscapes and Learning Rate Transfer
- Authors: Lorenzo Noci, Alexandru Meterez, Thomas Hofmann, Antonio Orvieto,
- Abstract summary: We study the landscape through the lens of the loss Hessian.
We find that certain spectral properties under $mu$P are largely independent of the size of the network.
We show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales.
- Score: 72.54450821671624
- License:
- Abstract: Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit (\mup and its depth extension), then some hyperparameters -- such as the learning rate -- exhibit transfer from small to very large models. From an optimization perspective, this phenomenon is puzzling, as it implies that the loss landscape is consistently similar across very different model sizes. In this work, we study the landscape through the lens of the loss Hessian, with a focus on its largest eigenvalue (i.e. the sharpness), and find that certain spectral properties under $\mu$P are largely independent of the size of the network, and remain consistent as training progresses. We name this property Super Consistency of the landscape. On the other hand, we show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales. But what causes these differences in the sharpness dynamics? Through a connection between the Hessian's and the NTK's spectrum, we argue that the cause lies in the presence (for $\mu$P) or progressive absence (for the NTK scaling) of feature learning. We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from ResNets and Vision Transformers trained on benchmark vision datasets to Transformers-based language models trained on WikiText.
Related papers
- On Learnable Parameters of Optimal and Suboptimal Deep Learning Models [2.889799048595314]
We study the structural and operational aspects of deep learning models.
Our research focuses on the nuances of learnable parameters (weight) statistics, distribution, node interaction, and visualization.
arXiv Detail & Related papers (2024-08-21T15:50:37Z) - Beyond Uniform Scaling: Exploring Depth Heterogeneity in Neural
Architectures [9.91972450276408]
We introduce an automated scaling approach leveraging second-order loss landscape information.
Our method is flexible towards skip connections a mainstay in modern vision transformers.
We introduce the first intact scaling mechanism for vision transformers, a step towards efficient model scaling.
arXiv Detail & Related papers (2024-02-19T09:52:45Z) - Unveiling the Unseen: Identifiable Clusters in Trained Depthwise
Convolutional Kernels [56.69755544814834]
Recent advances in depthwise-separable convolutional neural networks (DS-CNNs) have led to novel architectures.
This paper reveals another striking property of DS-CNN architectures: discernible and explainable patterns emerge in their trained depthwise convolutional kernels in all layers.
arXiv Detail & Related papers (2024-01-25T19:05:53Z) - From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport [32.39176908225668]
We introduce the concept of the non-linearity signature of DNN, the first theoretically sound solution for measuring the non-linearity of deep neural networks.
We provide extensive experimental results that highlight the practical usefulness of the proposed non-linearity signature.
arXiv Detail & Related papers (2023-10-17T17:50:22Z) - Feature-Learning Networks Are Consistent Across Widths At Realistic
Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets.
Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training.
We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z) - FuNNscope: Visual microscope for interactively exploring the loss
landscape of fully connected neural networks [77.34726150561087]
We show how to explore high-dimensional landscape characteristics of neural networks.
We generalize observations on small neural networks to more complex systems.
An interactive dashboard opens up a number of possible application networks.
arXiv Detail & Related papers (2022-04-09T16:41:53Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - The large learning rate phase of deep learning: the catapult mechanism [50.23041928811575]
We present a class of neural networks with solvable training dynamics.
We find good agreement between our model's predictions and training dynamics in realistic deep learning settings.
We believe our results shed light on characteristics of models trained at different learning rates.
arXiv Detail & Related papers (2020-03-04T17:52:48Z) - Hold me tight! Influence of discriminative features on deep network
boundaries [63.627760598441796]
We propose a new perspective that relates dataset features to the distance of samples to the decision boundary.
This enables us to carefully tweak the position of the training samples and measure the induced changes on the boundaries of CNNs trained on large-scale vision datasets.
arXiv Detail & Related papers (2020-02-15T09:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.