Related papers: The layer-wise L1 Loss Landscape of Neural Nets is more complex around local minima

The layer-wise L1 Loss Landscape of Neural Nets is more complex around local minima

URL: http://arxiv.org/abs/2105.02831v1
Date: Thu, 6 May 2021 17:18:44 GMT
Title: The layer-wise L1 Loss Landscape of Neural Nets is more complex around local minima
Authors: Peter Hinz
Abstract summary: We use the Deep ReLU Simplex algorithm to minimize the loss monotonically on adjacent vertices. In a neighbourhood around a local minimum, the iterations behave differently such that conclusions on loss level and proximity of the local minimum can be made before it has been found. This could have far-reaching consequences for the design of new gradient-descent algorithms.
Score: 3.04585143845864
License: http://creativecommons.org/licenses/by/4.0/
Abstract: For fixed training data and network parameters in the other layers the L1 loss of a ReLU neural network as a function of the first layer's parameters is a piece-wise affine function. We use the Deep ReLU Simplex algorithm to iteratively minimize the loss monotonically on adjacent vertices and analyze the trajectory of these vertex positions. We empirically observe that in a neighbourhood around a local minimum, the iterations behave differently such that conclusions on loss level and proximity of the local minimum can be made before it has been found: Firstly the loss seems to decay exponentially slow at iterated adjacent vertices such that the loss level at the local minimum can be estimated from the loss levels of subsequently iterated vertices, and secondly we observe a strong increase of the vertex density around local minima. This could have far-reaching consequences for the design of new gradient-descent algorithms that might improve convergence rate by exploiting these facts.

Related papers

Deep Loss Convexification for Learning Iterative Models [11.36644967267829]
Iterative methods such as iterative closest point (ICP) for point cloud registration often suffer from bad local optimality. We propose learning to form a convex landscape around each ground truth.
arXiv Detail & Related papers (2024-11-16T01:13:04Z)
A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks. We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z)
Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Escape, and Network Embedding [1.4513150969598634]
We study the loss landscape of one-hidden-layer neural networks with ReLU-like activation functions trained with the empirical squared loss. We identify the stationary points of such networks, which significantly slow down loss decrease during training.
arXiv Detail & Related papers (2024-02-08T12:30:29Z)
Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time [45.72323731094864]
In this paper, we study the optimality gap between two-layer ReLULU networks regularized with weight decay and their convex relaxations. Our study sheds new light on understanding why local methods work well.
arXiv Detail & Related papers (2024-02-06T01:29:35Z)
Training a Two Layer ReLU Network Analytically [4.94950858749529]
We will explore an algorithm for training two-layer neural networks with ReLU-like activation and the square loss. The method is faster than the gradient descent methods and has virtually no tuning parameters.
arXiv Detail & Related papers (2023-04-06T09:57:52Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent. We show that SGD is biased towards a simple solution. We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z)
Convergence of gradient descent for learning linear neural networks [2.209921757303168]
We show that gradient descent converges to a critical point of the loss function, i.e., the square loss in this article. In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank.
arXiv Detail & Related papers (2021-08-04T13:10:30Z)
Topological obstructions in neural networks learning [67.8848058842671]
We study global properties of the loss gradient function flow. We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface.
arXiv Detail & Related papers (2020-12-31T18:53:25Z)
When does gradient descent with logistic loss find interpolating two-layer networks? [51.1848572349154]
We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
arXiv Detail & Related papers (2020-12-04T05:16:51Z)
Second-Order Guarantees in Centralized, Federated and Decentralized Nonconvex Optimization [64.26238893241322]
Simple algorithms have been shown to lead to good empirical results in many contexts. Several works have pursued rigorous analytical justification for studying non optimization problems. A key insight in these analyses is that perturbations play a critical role in allowing local descent algorithms.
arXiv Detail & Related papers (2020-03-31T16:54:22Z)
Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep Network Losses [2.046307988932347]
gradient-based algorithms converge to approximately the same performance from random initial points. We show that the methods used to find putative critical points suffer from a bad minima problem of their own.
arXiv Detail & Related papers (2020-03-23T17:16:19Z)
Barcodes as Summary of Loss Function Topology [65.3479573549873]
We show that increase of the neural network's depth and width lowers the barcodes of local minima. This has some natural implications for the neural network's learning and for its generalization properties.
arXiv Detail & Related papers (2019-11-29T19:22:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.