Unveiling the structure of wide flat minima in neural networks
- URL: http://arxiv.org/abs/2107.01163v1
- Date: Fri, 2 Jul 2021 16:04:57 GMT
- Title: Unveiling the structure of wide flat minima in neural networks
- Authors: Carlo Baldassi, Clarissa Lauditi, Enrico M. Malatesta, Gabriele
Perugini, Riccardo Zecchina
- Abstract summary: Deep learning has revealed the application potential of networks across the sciences.
The success of deep learning has revealed the application potential of networks across the sciences.
- Score: 0.46664938579243564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of deep learning has revealed the application potential of neural
networks across the sciences and opened up fundamental theoretical problems. In
particular, the fact that learning algorithms based on simple variants of
gradient methods are able to find near-optimal minima of highly nonconvex loss
functions is an unexpected feature of neural networks which needs to be
understood in depth. Such algorithms are able to fit the data almost perfectly,
even in the presence of noise, and yet they have excellent predictive
capabilities. Several empirical results have shown a reproducible correlation
between the so-called flatness of the minima achieved by the algorithms and the
generalization performance. At the same time, statistical physics results have
shown that in nonconvex networks a multitude of narrow minima may coexist with
a much smaller number of wide flat minima, which generalize well. Here we show
that wide flat minima arise from the coalescence of minima that correspond to
high-margin classifications. Despite being exponentially rare compared to
zero-margin solutions, high-margin minima tend to concentrate in particular
regions. These minima are in turn surrounded by other solutions of smaller and
smaller margin, leading to dense regions of solutions over long distances. Our
analysis also provides an alternative analytical method for estimating when
flat minima appear and when algorithms begin to find solutions, as the number
of model parameters varies.
Related papers
- Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - How to escape sharp minima with random perturbations [48.095392390925745]
We study the notion of flat minima and the complexity of finding them.
For general cost functions, we discuss a gradient-based algorithm that finds an approximate flat local minimum efficiently.
For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpness-aware minimization.
arXiv Detail & Related papers (2023-05-25T02:12:33Z) - Typical and atypical solutions in non-convex neural networks with
discrete and continuous weights [2.7127628066830414]
We study the binary and continuous negative-margin perceptrons as simple non-constrained network models learning random rules and associations.
Both models exhibit subdominant minimizers which are extremely flat and wide.
For both models, the generalization performance as a learning device is shown to be greatly improved by the existence of wide flat minimizers.
arXiv Detail & Related papers (2023-04-26T23:34:40Z) - Questions for Flat-Minima Optimization of Modern Neural Networks [28.12506392321345]
Two methods for finding flat minima stand out: 1. Averaging methods (i.e. Weight Averaging, SWA) and 2. Minimax methods (i.e. Aware, Sharpness Minimization, SAM)
We investigate the loss surfaces from a systematic benchmarking of these approaches across computer vision, natural language processing, and graph learning tasks.
arXiv Detail & Related papers (2022-02-01T18:56:15Z) - Neighborhood Region Smoothing Regularization for Finding Flat Minima In
Deep Neural Networks [16.4654807047138]
We propose an effective regularization technique, called Neighborhood Region Smoothing (NRS)
NRS tries to regularize the neighborhood region in weight space to yield approximate outputs.
We empirically show that the minima found by NRS would have relatively smaller Hessian eigenvalues compared to the conventional method.
arXiv Detail & Related papers (2022-01-16T15:11:00Z) - Learning through atypical ''phase transitions'' in overparameterized
neural networks [0.43496401697112685]
Current deep neural networks are highly observableized (up to billions of connection weights) and nonlinear.
Yet they can fit data almost perfectly through overdense descent algorithms and achieve unexpected accuracy prediction.
These are formidable challenges without generalization.
arXiv Detail & Related papers (2021-10-01T23:28:07Z) - Local AdaGrad-Type Algorithm for Stochastic Convex-Concave Minimax
Problems [80.46370778277186]
Large scale convex-concave minimax problems arise in numerous applications, including game theory, robust training, and training of generative adversarial networks.
We develop a communication-efficient distributed extragrad algorithm, LocalAdaSient, with an adaptive learning rate suitable for solving convex-concave minimax problem in the.
Server model.
We demonstrate its efficacy through several experiments in both the homogeneous and heterogeneous settings.
arXiv Detail & Related papers (2021-06-18T09:42:05Z) - Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z) - Entropic gradient descent algorithms and wide flat minima [6.485776570966397]
We show analytically that there exist Bayes optimal pointwise estimators which correspond to minimizers belonging to wide flat regions.
We extend the analysis to the deep learning scenario by extensive numerical validations.
An easy to compute flatness measure shows a clear correlation with test accuracy.
arXiv Detail & Related papers (2020-06-14T13:22:19Z) - Second-Order Guarantees in Centralized, Federated and Decentralized
Nonconvex Optimization [64.26238893241322]
Simple algorithms have been shown to lead to good empirical results in many contexts.
Several works have pursued rigorous analytical justification for studying non optimization problems.
A key insight in these analyses is that perturbations play a critical role in allowing local descent algorithms.
arXiv Detail & Related papers (2020-03-31T16:54:22Z) - A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient
Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima.
We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.