How neural networks find generalizable solutions: Self-tuned annealing
in deep learning
- URL: http://arxiv.org/abs/2001.01678v1
- Date: Mon, 6 Jan 2020 17:35:54 GMT
- Title: How neural networks find generalizable solutions: Self-tuned annealing
in deep learning
- Authors: Yu Feng and Yuhai Tu
- Abstract summary: We find a robust inverse relation between the weight variance and the landscape flatness for all SGD-based learning algorithms.
Our study indicates that SGD attains a self-tuned landscape-dependent annealing strategy to find generalizable solutions at the flat minima of the landscape.
- Score: 7.372592187197655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the tremendous success of Stochastic Gradient Descent (SGD) algorithm
in deep learning, little is known about how SGD finds generalizable solutions
in the high-dimensional weight space. By analyzing the learning dynamics and
loss function landscape, we discover a robust inverse relation between the
weight variance and the landscape flatness (inverse of curvature) for all
SGD-based learning algorithms. To explain the inverse variance-flatness
relation, we develop a random landscape theory, which shows that the SGD noise
strength (effective temperature) depends inversely on the landscape flatness.
Our study indicates that SGD attains a self-tuned landscape-dependent annealing
strategy to find generalizable solutions at the flat minima of the landscape.
Finally, we demonstrate how these new theoretical insights lead to more
efficient algorithms, e.g., for avoiding catastrophic forgetting.
Related papers
- On the Convergence of (Stochastic) Gradient Descent for Kolmogorov--Arnold Networks [56.78271181959529]
Kolmogorov--Arnold Networks (KANs) have gained significant attention in the deep learning community.
Empirical investigations demonstrate that KANs optimized via gradient descent (SGD) are capable of achieving near-zero training loss.
arXiv Detail & Related papers (2024-10-10T15:34:10Z) - On the Generalization Capability of Temporal Graph Learning Algorithms:
Theoretical Insights and a Simpler Method [59.52204415829695]
Temporal Graph Learning (TGL) has become a prevalent technique across diverse real-world applications.
This paper investigates the generalization ability of different TGL algorithms.
We propose a simplified TGL network, which enjoys a small generalization error, improved overall performance, and lower model complexity.
arXiv Detail & Related papers (2024-02-26T08:22:22Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Stochastic gradient descent introduces an effective landscape-dependent
regularization favoring flat solutions [5.022507593837554]
Generalization is one of the most important problems in deep learning (DL)
There exist many low-loss solutions that fit the training data equally well.
The key question is which solution is more generalizable.
arXiv Detail & Related papers (2022-06-02T18:49:36Z) - Quasi-potential theory for escape problem: Quantitative sharpness effect
on SGD's escape from local minima [10.990447273771592]
We develop a quantitative theory on a slow gradient descent (SGD) algorithm.
We investigate the effect of sharpness of loss surfaces on the noise neural networks.
arXiv Detail & Related papers (2021-11-07T05:00:35Z) - Learning While Dissipating Information: Understanding the Generalization
Capability of SGLD [9.328633662865682]
We derive an algorithm-dependent generalization bound by analyzing gradient Langevin dynamics (SGLD)
Our analysis reveals an intricate trade-off between learning and information dissipation.
arXiv Detail & Related papers (2021-02-05T03:18:52Z) - Ridge Rider: Finding Diverse Solutions by Following Eigenvectors of the
Hessian [48.61341260604871]
Gradient Descent (SGD) is a key component of the success of deep neural networks (DNNs)
In this paper, we present a different approach by following the eigenvectors of the Hessian, which we call "ridges"
We show both theoretically and experimentally that our method, called Ridge Rider (RR), offers a promising direction for a variety of challenging problems.
arXiv Detail & Related papers (2020-11-12T17:15:09Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Anomalous diffusion dynamics of learning in deep neural networks [0.0]
Learning in deep neural networks (DNNs) is implemented through minimizing a highly non-equilibrium loss function.
We present a novel account of how such effective deep learning emerges through the interactions of the fractal-like structure of the loss landscape.
arXiv Detail & Related papers (2020-09-22T14:57:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.