Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Escaping, and Network Embedding
- URL: http://arxiv.org/abs/2402.05626v4
- Date: Tue, 11 Jun 2024 19:08:58 GMT
- Title: Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Escaping, and Network Embedding
- Authors: Zhengqing Wu, Berfin Simsek, Francois Ged,
- Abstract summary: We investigate the loss landscape of one-hidden-layer neural networks with ReLU-like activation functions trained with the empirical squared loss.
As the activation function is non-differentiable, it is so far unclear how to completely characterize the stationary points.
We show that, if a stationary point does not contain "escape neurons", which are defined with first-order conditions, then it must be a local minimum.
- Score: 1.4513150969598634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate the loss landscape of one-hidden-layer neural networks with ReLU-like activation functions trained with the empirical squared loss. As the activation function is non-differentiable, it is so far unclear how to completely characterize the stationary points. We propose the conditions for stationarity that apply to both non-differentiable and differentiable cases. Additionally, we show that, if a stationary point does not contain "escape neurons", which are defined with first-order conditions, then it must be a local minimum. Moreover, for the scalar-output case, the presence of an escape neuron guarantees that the stationary point is not a local minimum. Our results refine the description of the saddle-to-saddle training process starting from infinitesimally small (vanishing) initialization for shallow ReLU-like networks, linking saddle escaping directly with the parameter changes of escape neurons. Moreover, we are also able to fully discuss how network embedding, which is to instantiate a narrower network within a wider network, reshapes the stationary points.
Related papers
- Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - The Implicit Bias of Minima Stability in Multivariate Shallow ReLU
Networks [53.95175206863992]
We study the type of solutions to which gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss.
We prove that although shallow ReLU networks are universal approximators, stable shallow networks are not.
arXiv Detail & Related papers (2023-06-30T09:17:39Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Gradient descent provably escapes saddle points in the training of shallow ReLU networks [6.458742319938318]
We prove a variant of the relevant dynamical systems result, a center-stable manifold theorem, in which we relax some of the regularity requirements.
Building on a detailed examination of critical points of the square integral loss function for shallow ReLU and leaky ReLU networks, we show that gradient descents most saddle points.
arXiv Detail & Related papers (2022-08-03T14:08:52Z) - Semi-signed neural fitting for surface reconstruction from unoriented
point clouds [53.379712818791894]
We propose SSN-Fitting to reconstruct a better signed distance field.
SSN-Fitting consists of a semi-signed supervision and a loss-based region sampling strategy.
We conduct experiments to demonstrate that SSN-Fitting achieves state-of-the-art performance under different settings.
arXiv Detail & Related papers (2022-06-14T09:40:17Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - On the Omnipresence of Spurious Local Minima in Certain Neural Network
Training Problems [0.0]
We study the loss landscape of training problems for deep artificial neural networks with a one-dimensional real output.
It is shown that such problems possess a continuum of spurious (i.e., not globally optimal) local minima for all target functions that are not affine.
arXiv Detail & Related papers (2022-02-23T14:41:54Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - The layer-wise L1 Loss Landscape of Neural Nets is more complex around
local minima [3.04585143845864]
We use the Deep ReLU Simplex algorithm to minimize the loss monotonically on adjacent vertices.
In a neighbourhood around a local minimum, the iterations behave differently such that conclusions on loss level and proximity of the local minimum can be made before it has been found.
This could have far-reaching consequences for the design of new gradient-descent algorithms.
arXiv Detail & Related papers (2021-05-06T17:18:44Z) - On Connectivity of Solutions in Deep Learning: The Role of
Over-parameterization and Feature Quality [21.13299067136635]
We present a novel condition for ensuring the connectivity of two arbitrary points in parameter space.
This condition is provably milder than dropout stability, and it provides a connection between the problem of finding low-loss paths and the memorization capacity of neural nets.
arXiv Detail & Related papers (2021-02-18T23:44:08Z) - When does gradient descent with logistic loss find interpolating
two-layer networks? [51.1848572349154]
We show that gradient descent drives the training loss to zero if the initial loss is small enough.
When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
arXiv Detail & Related papers (2020-12-04T05:16:51Z) - No one-hidden-layer neural network can represent multivariable functions [0.0]
In a function approximation with a neural network, an input dataset is mapped to an output index by optimizing the parameters of each hidden-layer unit.
We present constraints on the parameters and its second derivative by constructing a continuum version of a one-hidden-layer neural network with the rectified linear unit (ReLU) activation function.
arXiv Detail & Related papers (2020-06-19T06:46:54Z) - GRNet: Gridding Residual Network for Dense Point Cloud Completion [54.43648460932248]
Estimating the complete 3D point cloud from an incomplete one is a key problem in many vision and robotics applications.
We propose a novel Gridding Residual Network (GRNet) for point cloud completion.
Experimental results indicate that the proposed GRNet performs favorably against state-of-the-art methods on the ShapeNet, Completion3D, and KITTI benchmarks.
arXiv Detail & Related papers (2020-06-06T02:46:39Z) - Piecewise linear activations substantially shape the loss surfaces of
neural networks [95.73230376153872]
This paper presents how piecewise linear activation functions substantially shape the loss surfaces of neural networks.
We first prove that it the loss surfaces of many neural networks have infinite spurious local minima which are defined as the local minima with higher empirical risks than the global minima.
For one-hidden-layer networks, we prove that all local minima in a cell constitute an equivalence class; they are concentrated in a valley; and they are all global minima in the cell.
arXiv Detail & Related papers (2020-03-27T04:59:34Z) - Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep
Network Losses [2.046307988932347]
gradient-based algorithms converge to approximately the same performance from random initial points.
We show that the methods used to find putative critical points suffer from a bad minima problem of their own.
arXiv Detail & Related papers (2020-03-23T17:16:19Z) - Ill-Posedness and Optimization Geometry for Nonlinear Neural Network
Training [4.7210697296108926]
We show that the nonlinear activation functions used in the network construction play a critical role in classifying stationary points of the loss landscape.
For shallow dense networks, the nonlinear activation function determines the Hessian nullspace in the vicinity of global minima.
We extend these results to deep dense neural networks, showing that the last activation function plays an important role in classifying stationary points.
arXiv Detail & Related papers (2020-02-07T16:33:34Z) - How Implicit Regularization of ReLU Neural Networks Characterizes the
Learned Function -- Part I: the 1-D Case of Two Layers with Random First
Layer [5.969858080492586]
We consider one dimensional (shallow) ReLU neural networks in which weights are chosen randomly and only the terminal layer is trained.
We show that for such networks L2-regularized regression corresponds in function space to regularizing the estimate's second derivative for fairly general loss functionals.
arXiv Detail & Related papers (2019-11-07T13:48:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.