Making Coherence Out of Nothing At All: Measuring the Evolution of
Gradient Alignment
- URL: http://arxiv.org/abs/2008.01217v1
- Date: Mon, 3 Aug 2020 21:51:24 GMT
- Title: Making Coherence Out of Nothing At All: Measuring the Evolution of
Gradient Alignment
- Authors: Satrajit Chatterjee, Piotr Zielinski
- Abstract summary: We propose a new metric ($m$-coherence) to experimentally study the alignment of per-example gradients during training.
We show that $m$-coherence is more interpretable, cheaper to compute ($O(m)$ instead of $O(m2)$ and mathematically cleaner.
- Score: 15.2292571922932
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a new metric ($m$-coherence) to experimentally study the alignment
of per-example gradients during training. Intuitively, given a sample of size
$m$, $m$-coherence is the number of examples in the sample that benefit from a
small step along the gradient of any one example on average. We show that
compared to other commonly used metrics, $m$-coherence is more interpretable,
cheaper to compute ($O(m)$ instead of $O(m^2)$) and mathematically cleaner. (We
note that $m$-coherence is closely connected to gradient diversity, a quantity
previously used in some theoretical bounds.) Using $m$-coherence, we study the
evolution of alignment of per-example gradients in ResNet and Inception models
on ImageNet and several variants with label noise, particularly from the
perspective of the recently proposed Coherent Gradients (CG) theory that
provides a simple, unified explanation for memorization and generalization
[Chatterjee, ICLR 20]. Although we have several interesting takeaways, our most
surprising result concerns memorization. Naively, one might expect that when
training with completely random labels, each example is fitted independently,
and so $m$-coherence should be close to 1. However, this is not the case:
$m$-coherence reaches much higher values during training (100s), indicating
that over-parameterized neural networks find common patterns even in scenarios
where generalization is not possible. A detailed analysis of this phenomenon
provides both a deeper confirmation of CG, but at the same point puts into
sharp relief what is missing from the theory in order to provide a complete
explanation of generalization in neural networks.
Related papers
- Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Characterizing Datapoints via Second-Split Forgetting [93.99363547536392]
We propose $$-second-$split$ $forgetting$ $time$ (SSFT), a complementary metric that tracks the epoch (if any) after which an original training example is forgotten.
We demonstrate that $mislabeled$ examples are forgotten quickly, and seemingly $rare$ examples are forgotten comparatively slowly.
SSFT can (i) help to identify mislabeled samples, the removal of which improves generalization; and (ii) provide insights about failure modes.
arXiv Detail & Related papers (2022-10-26T21:03:46Z) - Sharper Rates and Flexible Framework for Nonconvex SGD with Client and
Data Sampling [64.31011847952006]
We revisit the problem of finding an approximately stationary point of the average of $n$ smooth and possibly non-color functions.
We generalize the $smallsfcolorgreen$ so that it can provably work with virtually any sampling mechanism.
We provide the most general and most accurate analysis of optimal bound in the smooth non-color regime.
arXiv Detail & Related papers (2022-06-05T21:32:33Z) - Approximate Function Evaluation via Multi-Armed Bandits [51.146684847667125]
We study the problem of estimating the value of a known smooth function $f$ at an unknown point $boldsymbolmu in mathbbRn$, where each component $mu_i$ can be sampled via a noisy oracle.
We design an instance-adaptive algorithm that learns to sample according to the importance of each coordinate, and with probability at least $1-delta$ returns an $epsilon$ accurate estimate of $f(boldsymbolmu)$.
arXiv Detail & Related papers (2022-03-18T18:50:52Z) - On the Generalization Mystery in Deep Learning [15.2292571922932]
We argue that the answer to two questions lies in the interaction of the gradients of different examples during training.
We formalize this argument with an easy to compute and interpretable metric for coherence.
The theory also explains a number of other phenomena in deep learning, such as why some examples are reliably learned earlier than others.
arXiv Detail & Related papers (2022-03-18T16:09:53Z) - Locality defeats the curse of dimensionality in convolutional
teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$.
We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Why Are Convolutional Nets More Sample-Efficient than Fully-Connected
Nets? [33.51250867983687]
We show a natural task on which a provable sample complexity gap can be shown, for standard training algorithms.
We demonstrate a single target function, learning which on all possible distributions leads to an $O(1)$ vs $Omega(d2/varepsilon)$ gap.
Similar results are achieved for $ell$ regression and adaptive training algorithms, e.g. Adam and AdaGrad.
arXiv Detail & Related papers (2020-10-16T17:15:39Z) - The Interpolation Phase Transition in Neural Networks: Memorization and
Generalization under Lazy Training [10.72393527290646]
We study phenomena in the context of two-layers neural networks in the neural tangent (NT) regime.
We prove that as soon as $Ndgg n$, the test error is well approximated by one of kernel ridge regression with respect to the infinite-width kernel.
The latter is in turn well approximated by the error ridge regression, whereby the regularization parameter is increased by a self-induced' term related to the high-degree components of the activation function.
arXiv Detail & Related papers (2020-07-25T01:51:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.