Related papers: What training reveals about neural network complexity

What training reveals about neural network complexity

URL: http://arxiv.org/abs/2106.04186v1
Date: Tue, 8 Jun 2021 08:58:00 GMT
Title: What training reveals about neural network complexity
Authors: Andreas Loukas, Marinos Poiitis, Stefanie Jegelka
Abstract summary: This work explores the hypothesis that the complexity of the function a deep neural network (NN) is learning can be deduced by how fast its weights change during training. Our results support the hypothesis that good training behavior can be a useful bias towards good generalization.
Score: 80.87515604428346
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work explores the hypothesis that the complexity of the function a deep neural network (NN) is learning can be deduced by how fast its weights change during training. Our analysis provides evidence for this supposition by relating the network's distribution of Lipschitz constants (i.e., the norm of the gradient at different regions of the input space) during different training intervals with the behavior of the stochastic training procedure. We first observe that the average Lipschitz constant close to the training data affects various aspects of the parameter trajectory, with more complex networks having a longer trajectory, bigger variance, and often veering further from their initialization. We then show that NNs whose biases are trained more steadily have bounded complexity even in regions of the input space that are far from any training point. Finally, we find that steady training with Dropout implies a training- and data-dependent generalization bound that grows poly-logarithmically with the number of parameters. Overall, our results support the hypothesis that good training behavior can be a useful bias towards good generalization.

Related papers

The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions [51.68215326304272]
We show that even small perturbations reliably cause otherwise identical training trajectories to diverge-an effect that diminishes rapidly over training time.<n>Our findings provide insights into neural network training stability, with practical implications for fine-tuning, model merging, and diversity of model ensembles.
arXiv Detail & Related papers (2025-06-16T08:35:16Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Provable Benefits of Unsupervised Pre-training and Transfer Learning via Single-Index Models [7.71225721416736]
Unsupervised pre-training and transfer learning are commonly used to initialize training algorithms for neural networks. We study the effects of unsupervised pre-training and transfer learning on the sample complexity of high-dimensional supervised learning.
arXiv Detail & Related papers (2025-02-24T05:13:11Z)
Bifurcations and loss jumps in RNN training [7.937801286897863]
We introduce a novel algorithm for detecting all fixed points and k-cycles in ReLU-based RNNs and their existence and stability regions. Our algorithm provides exact results and returns fixed points and cycles up to high orders with surprisingly good scaling behavior.
arXiv Detail & Related papers (2023-10-26T16:49:44Z)
Learning a Neuron by a Shallow ReLU Network: Dynamics and Implicit Bias for Correlated Inputs [5.7166378791349315]
We prove that, for the fundamental regression task of learning a single neuron, training a one-hidden layer ReLU network converges to zero loss. We also show and characterise a surprising distinction in this setting between interpolator networks of minimal rank and those of minimal Euclidean norm.
arXiv Detail & Related papers (2023-06-10T16:36:22Z)
Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights. We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z)
Neural networks trained with SGD learn distributions of increasing complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics. We then exploit higher-order statistics only later during training. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z)
Learning Low Dimensional State Spaces with Overparameterized Recurrent Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory. Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z)
On the (Non-)Robustness of Two-Layer Neural Networks in Different Learning Regimes [27.156666384752548]
Neural networks are highly sensitive to adversarial examples. We study robustness and generalization in different scenarios. We show how linearized lazy training regimes can worsen robustness.
arXiv Detail & Related papers (2022-03-22T16:40:52Z)
Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error. We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z)
Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks. We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z)
More data or more parameters? Investigating the effect of data structure on generalization [17.249712222764085]
Properties of data impact the test error as a function of the number of training examples and number of training parameters. We show that noise in the labels and strong anisotropy of the input data play similar roles on the test error.
arXiv Detail & Related papers (2021-03-09T16:08:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.