Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep
Network Losses
- URL: http://arxiv.org/abs/2003.10397v1
- Date: Mon, 23 Mar 2020 17:16:19 GMT
- Title: Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep
Network Losses
- Authors: Charles G. Frye, James Simon, Neha S. Wadia, Andrew Ligeralde, Michael
R. DeWeese, Kristofer E. Bouchard
- Abstract summary: gradient-based algorithms converge to approximately the same performance from random initial points.
We show that the methods used to find putative critical points suffer from a bad minima problem of their own.
- Score: 2.046307988932347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the fact that the loss functions of deep neural networks are highly
non-convex, gradient-based optimization algorithms converge to approximately
the same performance from many random initial points. One thread of work has
focused on explaining this phenomenon by characterizing the local curvature
near critical points of the loss function, where the gradients are near zero,
and demonstrating that neural network losses enjoy a no-bad-local-minima
property and an abundance of saddle points. We report here that the methods
used to find these putative critical points suffer from a bad local minima
problem of their own: they often converge to or pass through regions where the
gradient norm has a stationary point. We call these gradient-flat regions,
since they arise when the gradient is approximately in the kernel of the
Hessian, such that the loss is locally approximately linear, or flat, in the
direction of the gradient. We describe how the presence of these regions
necessitates care in both interpreting past results that claimed to find
critical points of neural network losses and in designing second-order methods
for optimizing neural networks.
Related papers
- Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Global Convergence Analysis of Deep Linear Networks with A One-neuron
Layer [18.06634056613645]
We consider optimizing deep linear networks which have a layer with one neuron under quadratic loss.
We describe the convergent point of trajectories with arbitrary starting point under flow.
We show specific convergence rates of trajectories that converge to the global gradientr by stages.
arXiv Detail & Related papers (2022-01-08T04:44:59Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Convergence of gradient descent for learning linear neural networks [2.209921757303168]
We show that gradient descent converges to a critical point of the loss function, i.e., the square loss in this article.
In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank.
arXiv Detail & Related papers (2021-08-04T13:10:30Z) - The loss landscape of deep linear neural networks: a second-order analysis [9.85879905918703]
We study the optimization landscape of deep linear neural networks with the square loss.
We characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points.
arXiv Detail & Related papers (2021-07-28T11:33:18Z) - The layer-wise L1 Loss Landscape of Neural Nets is more complex around
local minima [3.04585143845864]
We use the Deep ReLU Simplex algorithm to minimize the loss monotonically on adjacent vertices.
In a neighbourhood around a local minimum, the iterations behave differently such that conclusions on loss level and proximity of the local minimum can be made before it has been found.
This could have far-reaching consequences for the design of new gradient-descent algorithms.
arXiv Detail & Related papers (2021-05-06T17:18:44Z) - Topological obstructions in neural networks learning [67.8848058842671]
We study global properties of the loss gradient function flow.
We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface.
arXiv Detail & Related papers (2020-12-31T18:53:25Z) - Second-Order Guarantees in Centralized, Federated and Decentralized
Nonconvex Optimization [64.26238893241322]
Simple algorithms have been shown to lead to good empirical results in many contexts.
Several works have pursued rigorous analytical justification for studying non optimization problems.
A key insight in these analyses is that perturbations play a critical role in allowing local descent algorithms.
arXiv Detail & Related papers (2020-03-31T16:54:22Z) - On the Convex Behavior of Deep Neural Networks in Relation to the
Layers' Width [99.24399270311069]
We observe that for wider networks, minimizing the loss with the descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between.
In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G.
arXiv Detail & Related papers (2020-01-14T16:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.