Phases of learning dynamics in artificial neural networks: with or
without mislabeled data
- URL: http://arxiv.org/abs/2101.06509v1
- Date: Sat, 16 Jan 2021 19:44:27 GMT
- Title: Phases of learning dynamics in artificial neural networks: with or
without mislabeled data
- Authors: Yu Feng and Yuhai Tu
- Abstract summary: We study dynamics of gradient descent that drives learning in neural networks.
Without mislabeled data, we find that the SGD learning dynamics transitions from a fast learning phase to a slow exploration phase.
We find that individual sample losses for the two datasets are most separated during phase II.
- Score: 3.3576886095389296
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite tremendous success of deep neural network in machine learning, the
underlying reason for its superior learning capability remains unclear. Here,
we present a framework based on statistical physics to study dynamics of
stochastic gradient descent (SGD) that drives learning in neural networks. By
using the minibatch gradient ensemble, we construct order parameters to
characterize dynamics of weight updates in SGD. Without mislabeled data, we
find that the SGD learning dynamics transitions from a fast learning phase to a
slow exploration phase, which is associated with large changes in order
parameters that characterize the alignment of SGD gradients and their mean
amplitude. In the case with randomly mislabeled samples, SGD learning dynamics
falls into four distinct phases. The system first finds solutions for the
correctly labeled samples in phase I, it then wanders around these solutions in
phase II until it finds a direction to learn the mislabeled samples during
phase III, after which it finds solutions that satisfy all training samples
during phase IV. Correspondingly, the test error decreases during phase I and
remains low during phase II; however, it increases during phase III and reaches
a high plateau during phase IV. The transitions between different phases can be
understood by changes of order parameters that characterize the alignment of
mean gradients for the correctly and incorrectly labeled samples and their
(relative) strength during learning. We find that individual sample losses for
the two datasets are most separated during phase II, which leads to a cleaning
process to eliminate mislabeled samples for improving generalization.
Related papers
- Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis [97.54180451650122]
We study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words.
We analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear layer.
We prove a novel property of the gradient flow, termed textitautomatic balancing of gradients, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss.
arXiv Detail & Related papers (2024-10-12T17:50:58Z) - Learning in PINNs: Phase transition, total diffusion, and generalization [1.8802875123957965]
We investigate the learning dynamics of fully-connected neural networks through the lens of gradient signal-to-noise ratio (SNR)
We identify a third phase termed total diffusion"
We explore the information-induced compression phenomenon, pinpointing a significant compression of activations at the total diffusion phase.
arXiv Detail & Related papers (2024-03-27T12:10:30Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - From Stability to Chaos: Analyzing Gradient Descent Dynamics in
Quadratic Regression [14.521929085104441]
We investigate the dynamics of gradient descent using large-order constant step-sizes in the context of quadratic regression models.
We delineate five distinct training phases: (1) monotonic, (2) catapult, (3) periodic, (4) chaotic, and (5) divergent.
In particular, we observe that performing an ergodic trajectory averaging stabilizes the test error in non-monotonic (and non-divergent) phases.
arXiv Detail & Related papers (2023-10-02T22:59:17Z) - Gradient-Based Feature Learning under Structured Data [57.76552698981579]
In the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction.
We show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue.
In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent.
arXiv Detail & Related papers (2023-09-07T16:55:50Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - The effective noise of Stochastic Gradient Descent [9.645196221785694]
Gradient Descent (SGD) is the workhorse algorithm of deep learning technology.
We characterize the parameters of SGD and a recently-introduced variant, persistent SGD, in a neural network model.
We find that noisier algorithms lead to wider decision boundaries of the corresponding constraint satisfaction problem.
arXiv Detail & Related papers (2021-12-20T20:46:19Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Unsupervised machine learning of topological phase transitions from
experimental data [52.77024349608834]
We apply unsupervised machine learning techniques to experimental data from ultracold atoms.
We obtain the topological phase diagram of the Haldane model in a completely unbiased fashion.
Our work provides a benchmark for unsupervised detection of new exotic phases in complex many-body systems.
arXiv Detail & Related papers (2021-01-14T16:38:21Z) - Two-Phase Learning for Overcoming Noisy Labels [16.390094129357774]
We propose a novel two-phase learning method, which automatically transitions its learning phase at the point when the network begins to memorize false-labeled samples.
MorPH significantly outperforms five state-of-the art methods in terms of test error and training time.
arXiv Detail & Related papers (2020-12-08T10:25:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.