Which Minimizer Does My Neural Network Converge To?
- URL: http://arxiv.org/abs/2011.02408v2
- Date: Thu, 30 Jun 2022 08:34:56 GMT
- Title: Which Minimizer Does My Neural Network Converge To?
- Authors: Manuel Nonnenmacher, David Reeb, Ingo Steinwart
- Abstract summary: We explain how common variants of the standard NN training procedure change the minimizer obtained.
We show that for adaptive optimization such as AdaGrad, the obtained minimizer generally differs from the gradient descent (GD) minimizer.
This adaptive minimizer is changed further by minibatch training, even though in the non-adaptive case, GD and GD result in essentially the same minimizer.
- Score: 5.575448433529451
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The loss surface of an overparameterized neural network (NN) possesses many
global minima of zero training error. We explain how common variants of the
standard NN training procedure change the minimizer obtained. First, we make
explicit how the size of the initialization of a strongly overparameterized NN
affects the minimizer and can deteriorate its final test performance. We
propose a strategy to limit this effect. Then, we demonstrate that for adaptive
optimization such as AdaGrad, the obtained minimizer generally differs from the
gradient descent (GD) minimizer. This adaptive minimizer is changed further by
stochastic mini-batch training, even though in the non-adaptive case, GD and
stochastic GD result in essentially the same minimizer. Lastly, we explain that
these effects remain relevant for less overparameterized NNs. While
overparameterization has its benefits, our work highlights that it induces
sources of error absent from underparameterized models.
Related papers
- Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification [53.727688136434345]
Graph Neural Networks (GNNs) have shown superior performance in node classification.
We present Fast Graph Sharpness-Aware Minimization (FGSAM) that integrates the rapid training of Multi-Layer Perceptrons with the superior performance of GNNs.
Our proposed algorithm outperforms the standard SAM with lower computational costs in FSNC tasks.
arXiv Detail & Related papers (2024-10-22T09:33:29Z) - Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - A Universal Class of Sharpness-Aware Minimization Algorithms [57.29207151446387]
We introduce a new class of sharpness measures, leading to new sharpness-aware objective functions.
We prove that these measures are textitly expressive, allowing any function of the training loss Hessian matrix to be represented by appropriate hyper and determinants.
arXiv Detail & Related papers (2024-06-06T01:52:09Z) - Adaptive Self-supervision Algorithms for Physics-informed Neural
Networks [59.822151945132525]
Physics-informed neural networks (PINNs) incorporate physical knowledge from the problem domain as a soft constraint on the loss function.
We study the impact of the location of the collocation points on the trainability of these models.
We propose a novel adaptive collocation scheme which progressively allocates more collocation points to areas where the model is making higher errors.
arXiv Detail & Related papers (2022-07-08T18:17:06Z) - Sharpness-Aware Training for Free [163.1248341911413]
SharpnessAware Minimization (SAM) has shown that minimizing a sharpness measure, which reflects the geometry of the loss landscape, can significantly reduce the generalization error.
Sharpness-Aware Training Free (SAF) mitigates the sharp landscape at almost zero computational cost over the base.
SAF ensures the convergence to a flat minimum with improved capabilities.
arXiv Detail & Related papers (2022-05-27T16:32:43Z) - Minimum Variance Unbiased N:M Sparsity for the Neural Gradients [29.555643722721882]
In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) up to x2.
We examine how this method can be used also for the neural gradients.
arXiv Detail & Related papers (2022-03-21T13:59:43Z) - On the Optimization Landscape of Neural Collapse under MSE Loss: Global
Optimality with Unconstrained Features [38.05002597295796]
Collapselayers collapse to the vertices of a Simplex Equiangular Tight Frame (ETF)
An intriguing empirical phenomenon has been widely observed in the last-layers and features of deep neural networks for tasks.
arXiv Detail & Related papers (2022-03-02T17:00:18Z) - BN-invariant sharpness regularizes the training model to better
generalization [72.97766238317081]
We propose a measure of sharpness, BN-Sharpness, which gives consistent value for equivalent networks under BN.
We use the BN-sharpness to regularize the training and design an algorithm to minimize the new regularized objective.
arXiv Detail & Related papers (2021-01-08T10:23:24Z) - The Effects of Mild Over-parameterization on the Optimization Landscape
of Shallow ReLU Neural Networks [36.35321290763711]
We prove that the objective is strongly convex around the global minima when the teacher and student networks possess the same number of neurons.
For the non-global minima, we prove that adding even just a single neuron will turn a non-global minimum into a saddle point.
arXiv Detail & Related papers (2020-06-01T15:13:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.