Related papers: Gradient Starvation: A Learning Proclivity in Neural Networks

Gradient Starvation: A Learning Proclivity in Neural Networks

URL: http://arxiv.org/abs/2011.09468v4
Date: Wed, 24 Nov 2021 18:24:37 GMT
Title: Gradient Starvation: A Learning Proclivity in Neural Networks
Authors: Mohammad Pezeshki, S\'ekou-Oumar Kaba, Yoshua Bengio, Aaron Courville, Doina Precup, Guillaume Lajoie
Abstract summary: Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
Score: 97.02382916372594
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We identify and formalize a fundamental gradient descent phenomenon resulting in a learning proclivity in over-parameterized neural networks. Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task, despite the presence of other predictive features that fail to be discovered. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks. Using tools from Dynamical Systems theory, we identify simple properties of learning dynamics during gradient descent that lead to this imbalance, and prove that such a situation can be expected given certain statistical structure in training data. Based on our proposed formalism, we develop guarantees for a novel regularization method aimed at decoupling feature learning dynamics, improving accuracy and robustness in cases hindered by gradient starvation. We illustrate our findings with simple and real-world out-of-distribution (OOD) generalization experiments.

Related papers

Certified Neural Approximations of Nonlinear Dynamics [52.79163248326912]
In safety-critical contexts, the use of neural approximations requires formal bounds on their closeness to the underlying system.<n>We propose a novel, adaptive, and parallelizable verification method based on certified first-order models.
arXiv Detail & Related papers (2025-05-21T13:22:20Z)
Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond [61.18736646013446]
In pursuit of a deeper understanding of its surprising behaviors, we investigate the utility of a simple yet accurate model of a trained neural network. Across three case studies, we illustrate how it can be applied to derive new empirical insights on a diverse range of prominent phenomena.
arXiv Detail & Related papers (2024-10-31T22:54:34Z)
Feature Contamination: Neural Networks Learn Uncorrelated Features and Fail to Generalize [5.642322814965062]
Learning representations that generalize under distribution shifts is critical for building robust machine learning models. We show that even allowing a neural network to explicitly fit the representations obtained from a teacher network that can generalize out-of-distribution is insufficient for the generalization of the student network.
arXiv Detail & Related papers (2024-06-05T15:04:27Z)
Navigate Beyond Shortcuts: Debiased Learning through the Lens of Neural Collapse [19.279084204631204]
We extend the investigation of Neural Collapse to the biased datasets with imbalanced attributes. We propose an avoid-shortcut learning framework without additional training complexity. With well-designed shortcut primes based on Neural Collapse structure, the models are encouraged to skip the pursuit of simple shortcuts.
arXiv Detail & Related papers (2024-05-09T07:23:37Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
On the ISS Property of the Gradient Flow for Single Hidden-Layer Neural Networks with Linear Activations [0.0]
We investigate the effects of overfitting on the robustness of gradient-descent training when subject to uncertainty on the gradient estimation. We show that the general overparametrized formulation introduces a set of spurious equilibria which lay outside the set where the loss function is minimized.
arXiv Detail & Related papers (2023-05-17T02:26:34Z)
Learning threshold neurons via the "edge of stability" [33.64379851307296]
Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. "Edge of stability" or "unstable dynamics" works on two-layer neural networks. This paper performs a detailed analysis of gradient descent for simplified models of two-layer neural networks.
arXiv Detail & Related papers (2022-12-14T19:27:03Z)
SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks. We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z)
Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs. By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z)
Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error. We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z)
Path classification by stochastic linear recurrent neural networks [2.5499055723658097]
We show that RNNs retain a partial signature of the paths they are fed as the unique information exploited for training and classification tasks. We argue that these RNNs are easy to train and robust and back these observations with numerical experiments on both synthetic and real data.
arXiv Detail & Related papers (2021-08-06T12:59:12Z)
Untangling tradeoffs between recurrence and self-attention in neural networks [81.30894993852813]
We present a formal analysis of how self-attention affects gradient propagation in recurrent networks. We prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies. We propose a relevancy screening mechanism that allows for a scalable use of sparse self-attention with recurrence.
arXiv Detail & Related papers (2020-06-16T19:24:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.