Related papers: Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization

Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization

URL: http://arxiv.org/abs/2002.10657v1
Date: Tue, 25 Feb 2020 03:59:31 GMT
Title: Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization
Authors: Satrajit Chatterjee
Abstract summary: We propose an approach to answering this question based on a hypothesis about the dynamics of gradient descent. We show that changes to the network parameters during training are biased towards those that (locally) simultaneously benefit many examples.
Score: 15.2292571922932
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: An open question in the Deep Learning community is why neural networks trained with Gradient Descent generalize well on real datasets even though they are capable of fitting random data. We propose an approach to answering this question based on a hypothesis about the dynamics of gradient descent that we call Coherent Gradients: Gradients from similar examples are similar and so the overall gradient is stronger in certain directions where these reinforce each other. Thus changes to the network parameters during training are biased towards those that (locally) simultaneously benefit many examples when such similarity exists. We support this hypothesis with heuristic arguments and perturbative experiments and outline how this can explain several common empirical observations about Deep Learning. Furthermore, our analysis is not just descriptive, but prescriptive. It suggests a natural modification to gradient descent that can greatly reduce overfitting.

Related papers

Parallel Momentum Methods Under Biased Gradient Estimations [11.074080383657453]
Parallel gradient methods are gaining prominence in solving large-scale machine learning problems that involve data distributed across multiple nodes. However, obtaining unbiased bounds, which have been the focus of most theoretical research, is challenging in many machine learning applications. In this paper we work out the implications for special gradient where estimates are biased, i.e. in meta-learning and when gradients are compressed or clipped.
arXiv Detail & Related papers (2024-02-29T18:03:03Z)
How to guess a gradient [68.98681202222664]
We show that gradients are more structured than previously thought. Exploiting this structure can significantly improve gradient-free optimization schemes. We highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
arXiv Detail & Related papers (2023-12-07T21:40:44Z)
Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z)
Gradient-Based Feature Learning under Structured Data [57.76552698981579]
In the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction. We show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue. In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent.
arXiv Detail & Related papers (2023-09-07T16:55:50Z)
On the Overlooked Structure of Stochastic Gradients [34.650998241703626]
We show that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and gradient noise caused by minibatch training usually do not exhibit power-law heavy tails. Our work challenges the existing belief and provides novel insights on the structure of gradients in deep learning.
arXiv Detail & Related papers (2022-12-05T07:55:22Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
The Manifold Hypothesis for Gradient-Based Explanations [55.01671263121624]
gradient-based explanation algorithms provide perceptually-aligned explanations. We show that the more a feature attribution is aligned with the tangent space of the data, the more perceptually-aligned it tends to be. We suggest that explanation algorithms should actively strive to align their explanations with the data manifold.
arXiv Detail & Related papers (2022-06-15T08:49:24Z)
Depth Without the Magic: Inductive Bias of Natural Gradient Descent [1.020554144865699]
In gradient descent, changing how we parametrize the model can lead to drastically different optimization trajectories. We characterize the behaviour of natural gradient flow in deep linear networks for separable classification under logistic loss and deep matrix factorization. We demonstrate that there exist learning problems where natural gradient descent fails to generalize, while gradient descent with the right architecture performs well.
arXiv Detail & Related papers (2021-11-22T21:20:10Z)
Continuous vs. Discrete Optimization of Deep Neural Networks [15.508460240818575]
We show that over deep neural networks with homogeneous activations, gradient flow trajectories enjoy favorable curvature. This finding allows us to translate an analysis of gradient flow over deep linear neural networks into a guarantee that gradient descent efficiently converges to global minimum. We hypothesize that the theory of gradient flows will be central to unraveling mysteries behind deep learning.
arXiv Detail & Related papers (2021-07-14T10:59:57Z)
Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems. We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z)
Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit Bias towards Low Rank [1.9350867959464846]
In deep learning, gradientdescent tends to prefer solutions which generalize well. In this paper we analyze the dynamics of gradient descent in the simplifiedsetting of linear networks and of an estimation problem.
arXiv Detail & Related papers (2020-11-27T15:08:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.