Gradient Starvation: A Learning Proclivity in Neural Networks
- URL: http://arxiv.org/abs/2011.09468v4
- Date: Wed, 24 Nov 2021 18:24:37 GMT
- Title: Gradient Starvation: A Learning Proclivity in Neural Networks
- Authors: Mohammad Pezeshki, S\'ekou-Oumar Kaba, Yoshua Bengio, Aaron Courville,
Doina Precup, Guillaume Lajoie
- Abstract summary: Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
- Score: 97.02382916372594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We identify and formalize a fundamental gradient descent phenomenon resulting
in a learning proclivity in over-parameterized neural networks. Gradient
Starvation arises when cross-entropy loss is minimized by capturing only a
subset of features relevant for the task, despite the presence of other
predictive features that fail to be discovered. This work provides a
theoretical explanation for the emergence of such feature imbalance in neural
networks. Using tools from Dynamical Systems theory, we identify simple
properties of learning dynamics during gradient descent that lead to this
imbalance, and prove that such a situation can be expected given certain
statistical structure in training data. Based on our proposed formalism, we
develop guarantees for a novel regularization method aimed at decoupling
feature learning dynamics, improving accuracy and robustness in cases hindered
by gradient starvation. We illustrate our findings with simple and real-world
out-of-distribution (OOD) generalization experiments.
Related papers
- Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond [61.18736646013446]
In pursuit of a deeper understanding of its surprising behaviors, we investigate the utility of a simple yet accurate model of a trained neural network.
Across three case studies, we illustrate how it can be applied to derive new empirical insights on a diverse range of prominent phenomena.
arXiv Detail & Related papers (2024-10-31T22:54:34Z) - Feature Contamination: Neural Networks Learn Uncorrelated Features and Fail to Generalize [5.642322814965062]
Learning representations that generalize under distribution shifts is critical for building robust machine learning models.
We show that even allowing a neural network to explicitly fit the representations obtained from a teacher network that can generalize out-of-distribution is insufficient for the generalization of the student network.
arXiv Detail & Related papers (2024-06-05T15:04:27Z) - Navigate Beyond Shortcuts: Debiased Learning through the Lens of Neural Collapse [19.279084204631204]
We extend the investigation of Neural Collapse to the biased datasets with imbalanced attributes.
We propose an avoid-shortcut learning framework without additional training complexity.
With well-designed shortcut primes based on Neural Collapse structure, the models are encouraged to skip the pursuit of simple shortcuts.
arXiv Detail & Related papers (2024-05-09T07:23:37Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - On the ISS Property of the Gradient Flow for Single Hidden-Layer Neural
Networks with Linear Activations [0.0]
We investigate the effects of overfitting on the robustness of gradient-descent training when subject to uncertainty on the gradient estimation.
We show that the general overparametrized formulation introduces a set of spurious equilibria which lay outside the set where the loss function is minimized.
arXiv Detail & Related papers (2023-05-17T02:26:34Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Path classification by stochastic linear recurrent neural networks [2.5499055723658097]
We show that RNNs retain a partial signature of the paths they are fed as the unique information exploited for training and classification tasks.
We argue that these RNNs are easy to train and robust and back these observations with numerical experiments on both synthetic and real data.
arXiv Detail & Related papers (2021-08-06T12:59:12Z) - Untangling tradeoffs between recurrence and self-attention in neural
networks [81.30894993852813]
We present a formal analysis of how self-attention affects gradient propagation in recurrent networks.
We prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies.
We propose a relevancy screening mechanism that allows for a scalable use of sparse self-attention with recurrence.
arXiv Detail & Related papers (2020-06-16T19:24:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.