Hidden Progress in Deep Learning: SGD Learns Parities Near the
  Computational Limit
        - URL: http://arxiv.org/abs/2207.08799v1
 - Date: Mon, 18 Jul 2022 17:55:05 GMT
 - Title: Hidden Progress in Deep Learning: SGD Learns Parities Near the
  Computational Limit
 - Authors: Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran
  Malach, Cyril Zhang
 - Abstract summary: This work conducts such an exploration through the lens of learning $k$-sparse parities of $n$ bits.
We find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time.
 - Score: 36.17720004582283
 - License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
 - Abstract:   There is mounting empirical evidence of emergent phenomena in the
capabilities of deep learning methods as we scale up datasets, model sizes, and
training times. While there are some accounts of how these resources modulate
statistical capacity, far less is known about their effect on the computational
problem of model training. This work conducts such an exploration through the
lens of learning $k$-sparse parities of $n$ bits, a canonical family of
problems which pose theoretical computational barriers. In this setting, we
find that neural networks exhibit surprising phase transitions when scaling up
dataset size and running time. In particular, we demonstrate empirically that
with standard training, a variety of architectures learn sparse parities with
$n^{O(k)}$ examples, with loss (and error) curves abruptly dropping after
$n^{O(k)}$ iterations. These positive results nearly match known SQ lower
bounds, even without an explicit sparsity-promoting prior. We elucidate the
mechanisms of these phenomena with a theoretical analysis: we find that the
phase transition in performance is not due to SGD "stumbling in the dark" until
it finds the hidden set of features (a natural algorithm which also runs in
$n^{O(k)}$ time); instead, we show that SGD gradually amplifies a Fourier gap
in the population gradient.
 
       
      
        Related papers
        - Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained   Neural Networks [59.552873049024775]
We show that compute-optimally trained models exhibit a remarkably precise universality.<n>With learning rate decay, the collapse becomes so tight that differences in the normalized curves across models fall below the noise floor.<n>We explain these phenomena by connecting collapse to the power-law structure in typical neural scaling laws.
arXiv  Detail & Related papers  (2025-07-02T20:03:34Z) - Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width   Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework.
We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values.
This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv  Detail & Related papers  (2025-03-12T17:33:13Z) - Dynamical Decoupling of Generalization and Overfitting in Large   Two-Layer Networks [12.061229162870513]
We study the training dynamics of two-layer neural networks.<n>We find several new phenomena in the training dynamics.<n>These include the emergence of a slow time scale associated with the growth in Gaussian/Rademacher complexity.
arXiv  Detail & Related papers  (2025-02-28T17:45:26Z) - MLPs at the EOC: Dynamics of Feature Learning [8.430481660019451]
We propose a theory to explain the convergence of gradient descent and the learning of features along the way.
Such a theory should also cover phenomena observed by practicioners including the Edge of Stability (EOS) and the catapult mechanism.
arXiv  Detail & Related papers  (2025-02-18T18:23:33Z) - Characterizing Datapoints via Second-Split Forgetting [93.99363547536392]
We propose $$-second-$split$ $forgetting$ $time$ (SSFT), a complementary metric that tracks the epoch (if any) after which an original training example is forgotten.
We demonstrate that $mislabeled$ examples are forgotten quickly, and seemingly $rare$ examples are forgotten comparatively slowly.
SSFT can (i) help to identify mislabeled samples, the removal of which improves generalization; and (ii) provide insights about failure modes.
arXiv  Detail & Related papers  (2022-10-26T21:03:46Z) - Scaling ResNets in the Large-depth Regime [11.374578778690623]
Deep ResNets are recognized for achieving state-of-the-art results in machine learning tasks.
Deep ResNets rely on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients.
No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $alpha_L$.
arXiv  Detail & Related papers  (2022-06-14T15:49:10Z) - A Communication-Efficient Distributed Gradient Clipping Algorithm for
  Training Deep Neural Networks [11.461878019780597]
Gradient Descent might converge slowly in some deep neural networks.
It remains mysterious whether gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup.
arXiv  Detail & Related papers  (2022-05-10T16:55:33Z) - Understanding the Under-Coverage Bias in Uncertainty Estimation [58.03725169462616]
quantile regression tends to emphunder-cover than the desired coverage level in reality.
We prove that quantile regression suffers from an inherent under-coverage bias.
Our theory reveals that this under-coverage bias stems from a certain high-dimensional parameter estimation error.
arXiv  Detail & Related papers  (2021-06-10T06:11:55Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv  Detail & Related papers  (2021-06-06T19:08:53Z) - Analysis of feature learning in weight-tied autoencoders via the mean
  field lens [3.553493344868413]
We analyze a class of two-layer weight-tied nonlinear autoencoders in the mean field framework.
Models trained with gradient descent are shown to admit a mean field limiting dynamics.
Experiments on real-life data demonstrate an interesting match with the theory.
arXiv  Detail & Related papers  (2021-02-16T18:58:37Z) - On Function Approximation in Reinforcement Learning: Optimism in the
  Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning.
In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function.
Our regret bounds are independent of the number of episodes.
arXiv  Detail & Related papers  (2020-11-09T18:32:22Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
  with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv  Detail & Related papers  (2020-11-04T21:07:52Z) - The Impact of the Mini-batch Size on the Variance of Gradients in
  Stochastic Gradient Descent [28.148743710421932]
The mini-batch gradient descent (SGD) algorithm is widely used in training machine learning models.
We study SGD dynamics under linear regression and two-layer linear networks, with an easy extension to deeper linear networks.
arXiv  Detail & Related papers  (2020-04-27T20:06:11Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.