Related papers: Generalization bounds via distillation

Generalization bounds via distillation

URL: http://arxiv.org/abs/2104.05641v1
Date: Mon, 12 Apr 2021 17:03:13 GMT
Title: Generalization bounds via distillation
Authors: Daniel Hsu and Ziwei Ji and Matus Telgarsky and Lan Wang
Abstract summary: Given a high-complexity network with poor generalization bounds, one can distill it into a network with nearly identical predictions but low complexity and vastly smaller generalization bounds. The main contribution is an analysis showing that the original network inherits this good generalization bound from its distillation. To round out the story, a (looser) classical uniform convergence analysis of compression is also presented.
Score: 45.42830829641181
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper theoretically investigates the following empirical phenomenon: given a high-complexity network with poor generalization bounds, one can distill it into a network with nearly identical predictions but low complexity and vastly smaller generalization bounds. The main contribution is an analysis showing that the original network inherits this good generalization bound from its distillation, assuming the use of well-behaved data augmentation. This bound is presented both in an abstract and in a concrete form, the latter complemented by a reduction technique to handle modern computation graphs featuring convolutional layers, fully-connected layers, and skip connections, to name a few. To round out the story, a (looser) classical uniform convergence analysis of compression is also presented, as well as a variety of experiments on cifar and mnist demonstrating similar generalization performance between the original network and its distillation.

Related papers

Generalization for Least Squares Regression With Simple Spiked Covariances [3.9134031118910264]
The generalization properties of even two-layer neural networks trained by gradient descent remain poorly understood. Recent work has made progress by describing the spectrum of the feature matrix at the hidden layer. Yet, the generalization error for linear models with spiked covariances has not been previously determined.
arXiv Detail & Related papers (2024-10-17T19:46:51Z)
What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding [67.59552859593985]
Graph Transformers, which incorporate self-attention and positional encoding, have emerged as a powerful architecture for various graph learning tasks. This paper introduces first theoretical investigation of a shallow Graph Transformer for semi-supervised classification.
arXiv Detail & Related papers (2024-06-04T05:30:16Z)
Asymptotic generalization error of a single-layer graph convolutional network [0.0]
We predict the performances of a single-layer graph convolutional network trained on data produced by attributed block models. We study the high signal-to-noise ratio limit, detail the convergence rates of the GCN and show that, while consistent, it does not reach the Bayes-optimal rate for any of the considered cases.
arXiv Detail & Related papers (2024-02-06T09:07:26Z)
Sparsity-aware generalization theory for deep neural networks [12.525959293825318]
We present a new approach to analyzing generalization for deep feed-forward ReLU networks. We show fundamental trade-offs between sparsity and generalization.
arXiv Detail & Related papers (2023-07-01T20:59:05Z)
Generalization analysis of an unfolding network for analysis-based Compressed Sensing [17.814125871206077]
Unfolding networks have shown promising results in the Compressed Sensing (CS) field. We perform a generalization analysis of a state-of-the-art ADMM-based unfolding network. Our proposed framework complies with our theoretical findings and outperforms the baseline.
arXiv Detail & Related papers (2023-03-09T21:13:32Z)
A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity [71.11795737362459]
ViTs with self-attention modules have recently achieved great empirical success in many tasks. However, theoretical learning generalization analysis is mostly noisy and elusive. This paper provides the first theoretical analysis of a shallow ViT for a classification task.
arXiv Detail & Related papers (2023-02-12T22:12:35Z)
Theoretical Characterization of How Neural Network Pruning Affects its Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero. More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z)
The Asymmetric Maximum Margin Bias of Quasi-Homogeneous Neural Networks [26.58848653965855]
We introduce the class of quasi-homogeneous models, which is expressive enough to describe nearly all neural networks with homogeneous activations. We find that gradient flow implicitly favors a subset of the parameters, unlike in the case of a homogeneous model where all parameters are treated equally.
arXiv Detail & Related papers (2022-10-07T21:14:09Z)
Generalization by design: Shortcuts to Generalization in Deep Learning [7.751691910877239]
We show that good generalization may be instigated by bounded spectral products over layers leading to a novel geometric regularizer. Backed up by theory we further demonstrate that "generalization by design" is practically possible and that good generalization may be encoded into the structure of the network.
arXiv Detail & Related papers (2021-07-05T20:01:23Z)
Predicting Unreliable Predictions by Shattering a Neural Network [145.3823991041987]
Piecewise linear neural networks can be split into subfunctions. Subfunctions have their own activation pattern, domain, and empirical error. Empirical error for the full network can be written as an expectation over subfunctions.
arXiv Detail & Related papers (2021-06-15T18:34:41Z)
Understanding Generalization in Deep Learning via Tensor Methods [53.808840694241]
We advance the understanding of the relations between the network's architecture and its generalizability from the compression perspective. We propose a series of intuitive, data-dependent and easily-measurable properties that tightly characterize the compressibility and generalizability of neural networks.
arXiv Detail & Related papers (2020-01-14T22:26:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.