Hidden Progress in Deep Learning: SGD Learns Parities Near the
Computational Limit
- URL: http://arxiv.org/abs/2207.08799v1
- Date: Mon, 18 Jul 2022 17:55:05 GMT
- Title: Hidden Progress in Deep Learning: SGD Learns Parities Near the
Computational Limit
- Authors: Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran
Malach, Cyril Zhang
- Abstract summary: This work conducts such an exploration through the lens of learning $k$-sparse parities of $n$ bits.
We find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time.
- Score: 36.17720004582283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is mounting empirical evidence of emergent phenomena in the
capabilities of deep learning methods as we scale up datasets, model sizes, and
training times. While there are some accounts of how these resources modulate
statistical capacity, far less is known about their effect on the computational
problem of model training. This work conducts such an exploration through the
lens of learning $k$-sparse parities of $n$ bits, a canonical family of
problems which pose theoretical computational barriers. In this setting, we
find that neural networks exhibit surprising phase transitions when scaling up
dataset size and running time. In particular, we demonstrate empirically that
with standard training, a variety of architectures learn sparse parities with
$n^{O(k)}$ examples, with loss (and error) curves abruptly dropping after
$n^{O(k)}$ iterations. These positive results nearly match known SQ lower
bounds, even without an explicit sparsity-promoting prior. We elucidate the
mechanisms of these phenomena with a theoretical analysis: we find that the
phase transition in performance is not due to SGD "stumbling in the dark" until
it finds the hidden set of features (a natural algorithm which also runs in
$n^{O(k)}$ time); instead, we show that SGD gradually amplifies a Fourier gap
in the population gradient.
Related papers
- Characterizing Datapoints via Second-Split Forgetting [93.99363547536392]
We propose $$-second-$split$ $forgetting$ $time$ (SSFT), a complementary metric that tracks the epoch (if any) after which an original training example is forgotten.
We demonstrate that $mislabeled$ examples are forgotten quickly, and seemingly $rare$ examples are forgotten comparatively slowly.
SSFT can (i) help to identify mislabeled samples, the removal of which improves generalization; and (ii) provide insights about failure modes.
arXiv Detail & Related papers (2022-10-26T21:03:46Z) - Scaling ResNets in the Large-depth Regime [11.374578778690623]
Deep ResNets are recognized for achieving state-of-the-art results in machine learning tasks.
Deep ResNets rely on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients.
No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $alpha_L$.
arXiv Detail & Related papers (2022-06-14T15:49:10Z) - A Communication-Efficient Distributed Gradient Clipping Algorithm for
Training Deep Neural Networks [11.461878019780597]
Gradient Descent might converge slowly in some deep neural networks.
It remains mysterious whether gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup.
arXiv Detail & Related papers (2022-05-10T16:55:33Z) - Understanding the Under-Coverage Bias in Uncertainty Estimation [58.03725169462616]
quantile regression tends to emphunder-cover than the desired coverage level in reality.
We prove that quantile regression suffers from an inherent under-coverage bias.
Our theory reveals that this under-coverage bias stems from a certain high-dimensional parameter estimation error.
arXiv Detail & Related papers (2021-06-10T06:11:55Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Analysis of feature learning in weight-tied autoencoders via the mean
field lens [3.553493344868413]
We analyze a class of two-layer weight-tied nonlinear autoencoders in the mean field framework.
Models trained with gradient descent are shown to admit a mean field limiting dynamics.
Experiments on real-life data demonstrate an interesting match with the theory.
arXiv Detail & Related papers (2021-02-16T18:58:37Z) - On Function Approximation in Reinforcement Learning: Optimism in the
Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning.
In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function.
Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - The Impact of the Mini-batch Size on the Variance of Gradients in
Stochastic Gradient Descent [28.148743710421932]
The mini-batch gradient descent (SGD) algorithm is widely used in training machine learning models.
We study SGD dynamics under linear regression and two-layer linear networks, with an easy extension to deeper linear networks.
arXiv Detail & Related papers (2020-04-27T20:06:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.