The layer-wise L1 Loss Landscape of Neural Nets is more complex around
local minima
- URL: http://arxiv.org/abs/2105.02831v1
- Date: Thu, 6 May 2021 17:18:44 GMT
- Title: The layer-wise L1 Loss Landscape of Neural Nets is more complex around
local minima
- Authors: Peter Hinz
- Abstract summary: We use the Deep ReLU Simplex algorithm to minimize the loss monotonically on adjacent vertices.
In a neighbourhood around a local minimum, the iterations behave differently such that conclusions on loss level and proximity of the local minimum can be made before it has been found.
This could have far-reaching consequences for the design of new gradient-descent algorithms.
- Score: 3.04585143845864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For fixed training data and network parameters in the other layers the L1
loss of a ReLU neural network as a function of the first layer's parameters is
a piece-wise affine function. We use the Deep ReLU Simplex algorithm to
iteratively minimize the loss monotonically on adjacent vertices and analyze
the trajectory of these vertex positions. We empirically observe that in a
neighbourhood around a local minimum, the iterations behave differently such
that conclusions on loss level and proximity of the local minimum can be made
before it has been found: Firstly the loss seems to decay exponentially slow at
iterated adjacent vertices such that the loss level at the local minimum can be
estimated from the loss levels of subsequently iterated vertices, and secondly
we observe a strong increase of the vertex density around local minima. This
could have far-reaching consequences for the design of new gradient-descent
algorithms that might improve convergence rate by exploiting these facts.
Related papers
- A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time [45.72323731094864]
In this paper, we study the optimality gap between two-layer ReLULU networks regularized with weight decay and their convex relaxations.
Our study sheds new light on understanding why local methods work well.
arXiv Detail & Related papers (2024-02-06T01:29:35Z) - Training a Two Layer ReLU Network Analytically [4.94950858749529]
We will explore an algorithm for training two-layer neural networks with ReLU-like activation and the square loss.
The method is faster than the gradient descent methods and has virtually no tuning parameters.
arXiv Detail & Related papers (2023-04-06T09:57:52Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Convergence of gradient descent for learning linear neural networks [2.209921757303168]
We show that gradient descent converges to a critical point of the loss function, i.e., the square loss in this article.
In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank.
arXiv Detail & Related papers (2021-08-04T13:10:30Z) - LocoProp: Enhancing BackProp via Local Loss Optimization [27.93980177594535]
We study a local loss construction approach for optimizing neural networks.
We show that our construction consistently improves convergence, reducing the gap between first-order and second-order methods.
arXiv Detail & Related papers (2021-06-11T07:00:02Z) - Topological obstructions in neural networks learning [67.8848058842671]
We study global properties of the loss gradient function flow.
We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface.
arXiv Detail & Related papers (2020-12-31T18:53:25Z) - When does gradient descent with logistic loss find interpolating
two-layer networks? [51.1848572349154]
We show that gradient descent drives the training loss to zero if the initial loss is small enough.
When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
arXiv Detail & Related papers (2020-12-04T05:16:51Z) - Second-Order Guarantees in Centralized, Federated and Decentralized
Nonconvex Optimization [64.26238893241322]
Simple algorithms have been shown to lead to good empirical results in many contexts.
Several works have pursued rigorous analytical justification for studying non optimization problems.
A key insight in these analyses is that perturbations play a critical role in allowing local descent algorithms.
arXiv Detail & Related papers (2020-03-31T16:54:22Z) - Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep
Network Losses [2.046307988932347]
gradient-based algorithms converge to approximately the same performance from random initial points.
We show that the methods used to find putative critical points suffer from a bad minima problem of their own.
arXiv Detail & Related papers (2020-03-23T17:16:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.