Related papers: Explaining generalization in deep learning: progress and fundamental limits

Explaining generalization in deep learning: progress and fundamental limits

URL: http://arxiv.org/abs/2110.08922v1
Date: Sun, 17 Oct 2021 21:17:30 GMT
Title: Explaining generalization in deep learning: progress and fundamental limits
Authors: Vaishnavh Nagarajan
Abstract summary: In the first part of the thesis, we will empirically study how training deep networks via gradient descent implicitly controls the networks' capacity. We will then derive em data-dependent em uniform-convergence-based generalization bounds with improved dependencies on the parameter count. In the last part of the thesis, we will introduce an em empirical technique to estimate generalization using unlabeled data.
Score: 8.299945169799795
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This dissertation studies a fundamental open challenge in deep learning theory: why do deep networks generalize well even while being overparameterized, unregularized and fitting the training data to zero error? In the first part of the thesis, we will empirically study how training deep networks via stochastic gradient descent implicitly controls the networks' capacity. Subsequently, to show how this leads to better generalization, we will derive {\em data-dependent} {\em uniform-convergence-based} generalization bounds with improved dependencies on the parameter count. Uniform convergence has in fact been the most widely used tool in deep learning literature, thanks to its simplicity and generality. Given its popularity, in this thesis, we will also take a step back to identify the fundamental limits of uniform convergence as a tool to explain generalization. In particular, we will show that in some example overparameterized settings, {\em any} uniform convergence bound will provide only a vacuous generalization bound. With this realization in mind, in the last part of the thesis, we will change course and introduce an {\em empirical} technique to estimate generalization using unlabeled data. Our technique does not rely on any notion of uniform-convergece-based complexity and is remarkably precise. We will theoretically show why our technique enjoys such precision. We will conclude by discussing how future work could explore novel ways to incorporate distributional assumptions in generalization bounds (such as in the form of unlabeled data) and explore other tools to derive bounds, perhaps by modifying uniform convergence or by developing completely new tools altogether.

Related papers

Bigger Isn't Always Memorizing: Early Stopping Overparameterized Diffusion Models [51.03144354630136]
Generalization in natural data domains is progressively achieved during training before the onset of memorization.<n>Generalization vs. memorization is then best understood as a competition between time scales.<n>We show that this phenomenology is recovered in diffusion models learning a simple probabilistic context-free grammar with random rules.
arXiv Detail & Related papers (2025-05-22T17:40:08Z)
Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking [50.465604300990904]
Grokking refers to the abrupt improvement in test accuracy after extended overfitting. In this work, we investigate the grokking mechanism underlying the Transformer in the task of prime number operations.
arXiv Detail & Related papers (2025-04-04T04:42:38Z)
Model Successor Functions [31.25792515137003]
In inductive generalization, it is often assumed that the training data lie in the easier side, while the testing data lie in the harder side. This work provides a formalization that centers on the concept of model successors. Then we outline directions to adapt well-established techniques towards the learning of model successors.
arXiv Detail & Related papers (2025-01-31T22:27:09Z)
Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity [84.12126298229866]
We show that zero-shot generalization during instruction tuning happens very early. We also show that encountering highly similar and fine-grained training data earlier during instruction tuning, without the constraints of defined "tasks", enables better generalization. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level.
arXiv Detail & Related papers (2024-06-17T16:40:21Z)
Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation [59.138470433237615]
We introduce statistical metrics that quantify both the linguistic and visual skew of a dataset for relational learning. We show that systematically controlled metrics are strongly predictive of generalization performance. This work informs an important direction towards quality-enhancing the data diversity or balance to scaling up the absolute size.
arXiv Detail & Related papers (2024-03-25T03:18:39Z)
Fantastic Generalization Measures are Nowhere to be Found [14.599761709255917]
We study the notion of a generalization bound being uniformly tight, meaning that the difference between the bound and the population loss is small. Numerous generalization bounds have been proposed in the literature as potential explanations for the ability of neural networks to generalize.
arXiv Detail & Related papers (2023-09-24T14:53:51Z)
Generalization Analysis for Contrastive Representation Learning [80.89690821916653]
Existing generalization error bounds depend linearly on the number $k$ of negative examples. We establish novel generalization bounds for contrastive learning which do not depend on $k$, up to logarithmic terms.
arXiv Detail & Related papers (2023-02-24T01:03:56Z)
Generalization Bounds with Data-dependent Fractal Dimensions [5.833272638548154]
We prove fractal geometry-based generalization bounds without requiring any Lipschitz assumption. Despite introducing a significant amount of technical complications, this new notion lets us control the generalization error.
arXiv Detail & Related papers (2023-02-06T13:24:48Z)
PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization [48.26492774959634]
We develop a compression approach based on quantizing neural network parameters in a linear subspace. We find large models can be compressed to a much greater extent than previously known, encapsulating Occam's razor.
arXiv Detail & Related papers (2022-11-24T13:50:16Z)
Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks. We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z)
Learning Non-Vacuous Generalization Bounds from Optimization [8.294831479902658]
We present a simple yet non-vacuous generalization bound from the optimization perspective. We achieve this goal by leveraging that the hypothesis set accessed by gradient algorithms is essentially fractal-like. Numerical studies demonstrate that our approach is able to yield plausible generalization guarantees for modern neural networks.
arXiv Detail & Related papers (2022-06-09T08:59:46Z)
An Online Learning Approach to Interpolation and Extrapolation in Domain Generalization [53.592597682854944]
We recast generalization over sub-groups as an online game between a player minimizing risk and an adversary presenting new test. We show that ERM is provably minimax-optimal for both tasks.
arXiv Detail & Related papers (2021-02-25T19:06:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.