Related papers: How neural networks find generalizable solutions: Self-tuned annealing in deep learning

How neural networks find generalizable solutions: Self-tuned annealing in deep learning

URL: http://arxiv.org/abs/2001.01678v1
Date: Mon, 6 Jan 2020 17:35:54 GMT
Title: How neural networks find generalizable solutions: Self-tuned annealing in deep learning
Authors: Yu Feng and Yuhai Tu
Abstract summary: We find a robust inverse relation between the weight variance and the landscape flatness for all SGD-based learning algorithms. Our study indicates that SGD attains a self-tuned landscape-dependent annealing strategy to find generalizable solutions at the flat minima of the landscape.
Score: 7.372592187197655
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the tremendous success of Stochastic Gradient Descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions in the high-dimensional weight space. By analyzing the learning dynamics and loss function landscape, we discover a robust inverse relation between the weight variance and the landscape flatness (inverse of curvature) for all SGD-based learning algorithms. To explain the inverse variance-flatness relation, we develop a random landscape theory, which shows that the SGD noise strength (effective temperature) depends inversely on the landscape flatness. Our study indicates that SGD attains a self-tuned landscape-dependent annealing strategy to find generalizable solutions at the flat minima of the landscape. Finally, we demonstrate how these new theoretical insights lead to more efficient algorithms, e.g., for avoiding catastrophic forgetting.

Related papers

On the Convergence of (Stochastic) Gradient Descent for Kolmogorov--Arnold Networks [56.78271181959529]
Kolmogorov--Arnold Networks (KANs) have gained significant attention in the deep learning community. Empirical investigations demonstrate that KANs optimized via gradient descent (SGD) are capable of achieving near-zero training loss.
arXiv Detail & Related papers (2024-10-10T15:34:10Z)
On the Generalization Capability of Temporal Graph Learning Algorithms: Theoretical Insights and a Simpler Method [59.52204415829695]
Temporal Graph Learning (TGL) has become a prevalent technique across diverse real-world applications. This paper investigates the generalization ability of different TGL algorithms. We propose a simplified TGL network, which enjoys a small generalization error, improved overall performance, and lower model complexity.
arXiv Detail & Related papers (2024-02-26T08:22:22Z)
Implicit Stochastic Gradient Descent for Training Physics-informed Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems. PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features. In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions [5.022507593837554]
Generalization is one of the most important problems in deep learning (DL) There exist many low-loss solutions that fit the training data equally well. The key question is which solution is more generalizable.
arXiv Detail & Related papers (2022-06-02T18:49:36Z)
Quasi-potential theory for escape problem: Quantitative sharpness effect on SGD's escape from local minima [10.990447273771592]
We develop a quantitative theory on a slow gradient descent (SGD) algorithm. We investigate the effect of sharpness of loss surfaces on the noise neural networks.
arXiv Detail & Related papers (2021-11-07T05:00:35Z)
Learning While Dissipating Information: Understanding the Generalization Capability of SGLD [9.328633662865682]
We derive an algorithm-dependent generalization bound by analyzing gradient Langevin dynamics (SGLD) Our analysis reveals an intricate trade-off between learning and information dissipation.
arXiv Detail & Related papers (2021-02-05T03:18:52Z)
Ridge Rider: Finding Diverse Solutions by Following Eigenvectors of the Hessian [48.61341260604871]
Gradient Descent (SGD) is a key component of the success of deep neural networks (DNNs) In this paper, we present a different approach by following the eigenvectors of the Hessian, which we call "ridges" We show both theoretically and experimentally that our method, called Ridge Rider (RR), offers a promising direction for a variety of challenging problems.
arXiv Detail & Related papers (2020-11-12T17:15:09Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Anomalous diffusion dynamics of learning in deep neural networks [0.0]
Learning in deep neural networks (DNNs) is implemented through minimizing a highly non-equilibrium loss function. We present a novel account of how such effective deep learning emerges through the interactions of the fractal-like structure of the loss landscape.
arXiv Detail & Related papers (2020-09-22T14:57:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.