On the Generalization Mystery in Deep Learning
- URL: http://arxiv.org/abs/2203.10036v1
- Date: Fri, 18 Mar 2022 16:09:53 GMT
- Title: On the Generalization Mystery in Deep Learning
- Authors: Satrajit Chatterjee and Piotr Zielinski
- Abstract summary: We argue that the answer to two questions lies in the interaction of the gradients of different examples during training.
We formalize this argument with an easy to compute and interpretable metric for coherence.
The theory also explains a number of other phenomena in deep learning, such as why some examples are reliably learned earlier than others.
- Score: 15.2292571922932
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The generalization mystery in deep learning is the following: Why do
over-parameterized neural networks trained with gradient descent (GD)
generalize well on real datasets even though they are capable of fitting random
datasets of comparable size? Furthermore, from among all solutions that fit the
training data, how does GD find one that generalizes well (when such a
well-generalizing solution exists)?
We argue that the answer to both questions lies in the interaction of the
gradients of different examples during training. Intuitively, if the
per-example gradients are well-aligned, that is, if they are coherent, then one
may expect GD to be (algorithmically) stable, and hence generalize well. We
formalize this argument with an easy to compute and interpretable metric for
coherence, and show that the metric takes on very different values on real and
random datasets for several common vision networks. The theory also explains a
number of other phenomena in deep learning, such as why some examples are
reliably learned earlier than others, why early stopping works, and why it is
possible to learn from noisy labels. Moreover, since the theory provides a
causal explanation of how GD finds a well-generalizing solution when one
exists, it motivates a class of simple modifications to GD that attenuate
memorization and improve generalization.
Generalization in deep learning is an extremely broad phenomenon, and
therefore, it requires an equally general explanation. We conclude with a
survey of alternative lines of attack on this problem, and argue that the
proposed approach is the most viable one on this basis.
Related papers
- Generalization of Graph Neural Networks is Robust to Model Mismatch [84.01980526069075]
Graph neural networks (GNNs) have demonstrated their effectiveness in various tasks supported by their generalization capabilities.
In this paper, we examine GNNs that operate on geometric graphs generated from manifold models.
Our analysis reveals the robustness of the GNN generalization in the presence of such model mismatch.
arXiv Detail & Related papers (2024-08-25T16:00:44Z) - Theoretical Characterization of the Generalization Performance of
Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features.
We find new and interesting properties that do not exist in single-task linear regression.
Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z) - Characterizing Datapoints via Second-Split Forgetting [93.99363547536392]
We propose $$-second-$split$ $forgetting$ $time$ (SSFT), a complementary metric that tracks the epoch (if any) after which an original training example is forgotten.
We demonstrate that $mislabeled$ examples are forgotten quickly, and seemingly $rare$ examples are forgotten comparatively slowly.
SSFT can (i) help to identify mislabeled samples, the removal of which improves generalization; and (ii) provide insights about failure modes.
arXiv Detail & Related papers (2022-10-26T21:03:46Z) - Towards understanding how momentum improves generalization in deep
learning [44.441873298005326]
We show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems.
Key insight in our analysis is that momentum is beneficial in datasets where the examples share some feature but differ in their margin.
arXiv Detail & Related papers (2022-07-13T02:39:08Z) - Learning Non-Vacuous Generalization Bounds from Optimization [8.294831479902658]
We present a simple yet non-vacuous generalization bound from the optimization perspective.
We achieve this goal by leveraging that the hypothesis set accessed by gradient algorithms is essentially fractal-like.
Numerical studies demonstrate that our approach is able to yield plausible generalization guarantees for modern neural networks.
arXiv Detail & Related papers (2022-06-09T08:59:46Z) - Explaining generalization in deep learning: progress and fundamental
limits [8.299945169799795]
In the first part of the thesis, we will empirically study how training deep networks via gradient descent implicitly controls the networks' capacity.
We will then derive em data-dependent em uniform-convergence-based generalization bounds with improved dependencies on the parameter count.
In the last part of the thesis, we will introduce an em empirical technique to estimate generalization using unlabeled data.
arXiv Detail & Related papers (2021-10-17T21:17:30Z) - Parameterized Explainer for Graph Neural Network [49.79917262156429]
We propose PGExplainer, a parameterized explainer for Graph Neural Networks (GNNs)
Compared to the existing work, PGExplainer has better generalization ability and can be utilized in an inductive setting easily.
Experiments on both synthetic and real-life datasets show highly competitive performance with up to 24.7% relative improvement in AUC on explaining graph classification.
arXiv Detail & Related papers (2020-11-09T17:15:03Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Making Coherence Out of Nothing At All: Measuring the Evolution of
Gradient Alignment [15.2292571922932]
We propose a new metric ($m$-coherence) to experimentally study the alignment of per-example gradients during training.
We show that $m$-coherence is more interpretable, cheaper to compute ($O(m)$ instead of $O(m2)$ and mathematically cleaner.
arXiv Detail & Related papers (2020-08-03T21:51:24Z) - Optimization and Generalization Analysis of Transduction through
Gradient Boosting and Application to Multi-scale Graph Neural Networks [60.22494363676747]
It is known that the current graph neural networks (GNNs) are difficult to make themselves deep due to the problem known as over-smoothing.
Multi-scale GNNs are a promising approach for mitigating the over-smoothing problem.
We derive the optimization and generalization guarantees of transductive learning algorithms that include multi-scale GNNs.
arXiv Detail & Related papers (2020-06-15T17:06:17Z) - Coherent Gradients: An Approach to Understanding Generalization in
Gradient Descent-based Optimization [15.2292571922932]
We propose an approach to answering this question based on a hypothesis about the dynamics of gradient descent.
We show that changes to the network parameters during training are biased towards those that (locally) simultaneously benefit many examples.
arXiv Detail & Related papers (2020-02-25T03:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.