Related papers: Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit

Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit

URL: http://arxiv.org/abs/2207.08799v1
Date: Mon, 18 Jul 2022 17:55:05 GMT
Title: Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit
Authors: Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang
Abstract summary: This work conducts such an exploration through the lens of learning $k$-sparse parities of $n$ bits. We find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time.
Score: 36.17720004582283
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There is mounting empirical evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning $k$-sparse parities of $n$ bits, a canonical family of problems which pose theoretical computational barriers. In this setting, we find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time. In particular, we demonstrate empirically that with standard training, a variety of architectures learn sparse parities with $n^{O(k)}$ examples, with loss (and error) curves abruptly dropping after $n^{O(k)}$ iterations. These positive results nearly match known SQ lower bounds, even without an explicit sparsity-promoting prior. We elucidate the mechanisms of these phenomena with a theoretical analysis: we find that the phase transition in performance is not due to SGD "stumbling in the dark" until it finds the hidden set of features (a natural algorithm which also runs in $n^{O(k)}$ time); instead, we show that SGD gradually amplifies a Fourier gap in the population gradient.

Related papers

Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework. We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z)
MLPs at the EOC: Dynamics of Feature Learning [8.430481660019451]
We propose a theory to explain the convergence of gradient descent and the learning of features along the way. Such a theory should also cover phenomena observed by practicioners including the Edge of Stability (EOS) and the catapult mechanism.
arXiv Detail & Related papers (2025-02-18T18:23:33Z)
Characterizing Datapoints via Second-Split Forgetting [93.99363547536392]
We propose $$-second-$split$ $forgetting$ $time$ (SSFT), a complementary metric that tracks the epoch (if any) after which an original training example is forgotten. We demonstrate that $mislabeled$ examples are forgotten quickly, and seemingly $rare$ examples are forgotten comparatively slowly. SSFT can (i) help to identify mislabeled samples, the removal of which improves generalization; and (ii) provide insights about failure modes.
arXiv Detail & Related papers (2022-10-26T21:03:46Z)
Scaling ResNets in the Large-depth Regime [11.374578778690623]
Deep ResNets are recognized for achieving state-of-the-art results in machine learning tasks. Deep ResNets rely on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $alpha_L$.
arXiv Detail & Related papers (2022-06-14T15:49:10Z)
A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks [11.461878019780597]
Gradient Descent might converge slowly in some deep neural networks. It remains mysterious whether gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup.
arXiv Detail & Related papers (2022-05-10T16:55:33Z)
Understanding the Under-Coverage Bias in Uncertainty Estimation [58.03725169462616]
quantile regression tends to emphunder-cover than the desired coverage level in reality. We prove that quantile regression suffers from an inherent under-coverage bias. Our theory reveals that this under-coverage bias stems from a certain high-dimensional parameter estimation error.
arXiv Detail & Related papers (2021-06-10T06:11:55Z)
Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
Analysis of feature learning in weight-tied autoencoders via the mean field lens [3.553493344868413]
We analyze a class of two-layer weight-tied nonlinear autoencoders in the mean field framework. Models trained with gradient descent are shown to admit a mean field limiting dynamics. Experiments on real-life data demonstrate an interesting match with the theory.
arXiv Detail & Related papers (2021-02-16T18:58:37Z)
On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning. In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function. Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
The Impact of the Mini-batch Size on the Variance of Gradients in Stochastic Gradient Descent [28.148743710421932]
The mini-batch gradient descent (SGD) algorithm is widely used in training machine learning models. We study SGD dynamics under linear regression and two-layer linear networks, with an easy extension to deeper linear networks.
arXiv Detail & Related papers (2020-04-27T20:06:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.