Related papers: On the Generalization Mystery in Deep Learning

On the Generalization Mystery in Deep Learning

URL: http://arxiv.org/abs/2203.10036v1
Date: Fri, 18 Mar 2022 16:09:53 GMT
Title: On the Generalization Mystery in Deep Learning
Authors: Satrajit Chatterjee and Piotr Zielinski
Abstract summary: We argue that the answer to two questions lies in the interaction of the gradients of different examples during training. We formalize this argument with an easy to compute and interpretable metric for coherence. The theory also explains a number of other phenomena in deep learning, such as why some examples are reliably learned earlier than others.
Score: 15.2292571922932
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The generalization mystery in deep learning is the following: Why do over-parameterized neural networks trained with gradient descent (GD) generalize well on real datasets even though they are capable of fitting random datasets of comparable size? Furthermore, from among all solutions that fit the training data, how does GD find one that generalizes well (when such a well-generalizing solution exists)? We argue that the answer to both questions lies in the interaction of the gradients of different examples during training. Intuitively, if the per-example gradients are well-aligned, that is, if they are coherent, then one may expect GD to be (algorithmically) stable, and hence generalize well. We formalize this argument with an easy to compute and interpretable metric for coherence, and show that the metric takes on very different values on real and random datasets for several common vision networks. The theory also explains a number of other phenomena in deep learning, such as why some examples are reliably learned earlier than others, why early stopping works, and why it is possible to learn from noisy labels. Moreover, since the theory provides a causal explanation of how GD finds a well-generalizing solution when one exists, it motivates a class of simple modifications to GD that attenuate memorization and improve generalization. Generalization in deep learning is an extremely broad phenomenon, and therefore, it requires an equally general explanation. We conclude with a survey of alternative lines of attack on this problem, and argue that the proposed approach is the most viable one on this basis.

Related papers

Deep Learning is Not So Mysterious or Different [54.5330466151362]
We argue that anomalous generalization behaviour is not distinct to neural networks. We present soft inductive biases as a key unifying principle in explaining these phenomena. We also highlight how deep learning is relatively distinct in other ways, such as its ability for representation learning.
arXiv Detail & Related papers (2025-03-03T22:56:04Z)
Generalization of Graph Neural Networks is Robust to Model Mismatch [84.01980526069075]
Graph neural networks (GNNs) have demonstrated their effectiveness in various tasks supported by their generalization capabilities. In this paper, we examine GNNs that operate on geometric graphs generated from manifold models. Our analysis reveals the robustness of the GNN generalization in the presence of such model mismatch.
arXiv Detail & Related papers (2024-08-25T16:00:44Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)
Characterizing Datapoints via Second-Split Forgetting [93.99363547536392]
We propose $$-second-$split$ $forgetting$ $time$ (SSFT), a complementary metric that tracks the epoch (if any) after which an original training example is forgotten. We demonstrate that $mislabeled$ examples are forgotten quickly, and seemingly $rare$ examples are forgotten comparatively slowly. SSFT can (i) help to identify mislabeled samples, the removal of which improves generalization; and (ii) provide insights about failure modes.
arXiv Detail & Related papers (2022-10-26T21:03:46Z)
Towards understanding how momentum improves generalization in deep learning [44.441873298005326]
We show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems. Key insight in our analysis is that momentum is beneficial in datasets where the examples share some feature but differ in their margin.
arXiv Detail & Related papers (2022-07-13T02:39:08Z)
Learning Non-Vacuous Generalization Bounds from Optimization [8.294831479902658]
We present a simple yet non-vacuous generalization bound from the optimization perspective. We achieve this goal by leveraging that the hypothesis set accessed by gradient algorithms is essentially fractal-like. Numerical studies demonstrate that our approach is able to yield plausible generalization guarantees for modern neural networks.
arXiv Detail & Related papers (2022-06-09T08:59:46Z)
Explaining generalization in deep learning: progress and fundamental limits [8.299945169799795]
In the first part of the thesis, we will empirically study how training deep networks via gradient descent implicitly controls the networks' capacity. We will then derive em data-dependent em uniform-convergence-based generalization bounds with improved dependencies on the parameter count. In the last part of the thesis, we will introduce an em empirical technique to estimate generalization using unlabeled data.
arXiv Detail & Related papers (2021-10-17T21:17:30Z)
Parameterized Explainer for Graph Neural Network [49.79917262156429]
We propose PGExplainer, a parameterized explainer for Graph Neural Networks (GNNs) Compared to the existing work, PGExplainer has better generalization ability and can be utilized in an inductive setting easily. Experiments on both synthetic and real-life datasets show highly competitive performance with up to 24.7% relative improvement in AUC on explaining graph classification.
arXiv Detail & Related papers (2020-11-09T17:15:03Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Making Coherence Out of Nothing At All: Measuring the Evolution of Gradient Alignment [15.2292571922932]
We propose a new metric ($m$-coherence) to experimentally study the alignment of per-example gradients during training. We show that $m$-coherence is more interpretable, cheaper to compute ($O(m)$ instead of $O(m2)$ and mathematically cleaner.
arXiv Detail & Related papers (2020-08-03T21:51:24Z)
Optimization and Generalization Analysis of Transduction through Gradient Boosting and Application to Multi-scale Graph Neural Networks [60.22494363676747]
It is known that the current graph neural networks (GNNs) are difficult to make themselves deep due to the problem known as over-smoothing. Multi-scale GNNs are a promising approach for mitigating the over-smoothing problem. We derive the optimization and generalization guarantees of transductive learning algorithms that include multi-scale GNNs.
arXiv Detail & Related papers (2020-06-15T17:06:17Z)
Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization [15.2292571922932]
We propose an approach to answering this question based on a hypothesis about the dynamics of gradient descent. We show that changes to the network parameters during training are biased towards those that (locally) simultaneously benefit many examples.
arXiv Detail & Related papers (2020-02-25T03:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.