Ill-Posedness and Optimization Geometry for Nonlinear Neural Network
Training
- URL: http://arxiv.org/abs/2002.02882v1
- Date: Fri, 7 Feb 2020 16:33:34 GMT
- Title: Ill-Posedness and Optimization Geometry for Nonlinear Neural Network
Training
- Authors: Thomas O'Leary-Roseberry, Omar Ghattas
- Abstract summary: We show that the nonlinear activation functions used in the network construction play a critical role in classifying stationary points of the loss landscape.
For shallow dense networks, the nonlinear activation function determines the Hessian nullspace in the vicinity of global minima.
We extend these results to deep dense neural networks, showing that the last activation function plays an important role in classifying stationary points.
- Score: 4.7210697296108926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work we analyze the role nonlinear activation functions play at
stationary points of dense neural network training problems. We consider a
generic least squares loss function training formulation. We show that the
nonlinear activation functions used in the network construction play a critical
role in classifying stationary points of the loss landscape. We show that for
shallow dense networks, the nonlinear activation function determines the
Hessian nullspace in the vicinity of global minima (if they exist), and
therefore determines the ill-posedness of the training problem. Furthermore,
for shallow nonlinear networks we show that the zeros of the activation
function and its derivatives can lead to spurious local minima, and discuss
conditions for strict saddle points. We extend these results to deep dense
neural networks, showing that the last activation function plays an important
role in classifying stationary points, due to how it shows up in the gradient
from the chain rule.
Related papers
- Coding schemes in neural networks learning classification tasks [52.22978725954347]
We investigate fully-connected, wide neural networks learning classification tasks.
We show that the networks acquire strong, data-dependent features.
Surprisingly, the nature of the internal representations depends crucially on the neuronal nonlinearity.
arXiv Detail & Related papers (2024-06-24T14:50:05Z) - Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Escaping, and Network Embedding [1.4513150969598634]
We investigate the loss landscape of one-hidden-layer neural networks with ReLU-like activation functions trained with the empirical squared loss.
As the activation function is non-differentiable, it is so far unclear how to completely characterize the stationary points.
We show that, if a stationary point does not contain "escape neurons", which are defined with first-order conditions, then it must be a local minimum.
arXiv Detail & Related papers (2024-02-08T12:30:29Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Nonlinear Advantage: Trained Networks Might Not Be As Complex as You
Think [0.0]
We investigate how much we can simplify the network function towards linearity before performance collapses.
We find that after training, we are able to linearize a significant number of nonlinear units while maintaining a high performance.
Under sparsity pressure, we find that the remaining nonlinear units organize into distinct structures, forming core-networks of near constant effective depth and width.
arXiv Detail & Related papers (2022-11-30T17:24:14Z) - Exploring Linear Feature Disentanglement For Neural Networks [63.20827189693117]
Non-linear activation functions, e.g., Sigmoid, ReLU, and Tanh, have achieved great success in neural networks (NNs)
Due to the complex non-linear characteristic of samples, the objective of those activation functions is to project samples from their original feature space to a linear separable feature space.
This phenomenon ignites our interest in exploring whether all features need to be transformed by all non-linear functions in current typical NNs.
arXiv Detail & Related papers (2022-03-22T13:09:17Z) - On the Omnipresence of Spurious Local Minima in Certain Neural Network
Training Problems [0.0]
We study the loss landscape of training problems for deep artificial neural networks with a one-dimensional real output.
It is shown that such problems possess a continuum of spurious (i.e., not globally optimal) local minima for all target functions that are not affine.
arXiv Detail & Related papers (2022-02-23T14:41:54Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Activation function design for deep networks: linearity and effective
initialisation [10.108857371774977]
We study how to avoid two problems at initialisation identified in prior works.
We prove that both these problems can be avoided by choosing an activation function possessing a sufficiently large linear region around the origin.
arXiv Detail & Related papers (2021-05-17T11:30:46Z) - Piecewise linear activations substantially shape the loss surfaces of
neural networks [95.73230376153872]
This paper presents how piecewise linear activation functions substantially shape the loss surfaces of neural networks.
We first prove that it the loss surfaces of many neural networks have infinite spurious local minima which are defined as the local minima with higher empirical risks than the global minima.
For one-hidden-layer networks, we prove that all local minima in a cell constitute an equivalence class; they are concentrated in a valley; and they are all global minima in the cell.
arXiv Detail & Related papers (2020-03-27T04:59:34Z) - Deep Neural Networks with Trainable Activations and Controlled Lipschitz
Constant [26.22495169129119]
We introduce a variational framework to learn the activation functions of deep neural networks.
Our aim is to increase the capacity of the network while controlling an upper-bound of the Lipschitz constant.
We numerically compare our scheme with standard ReLU network and its variations, PReLU and LeakyReLU.
arXiv Detail & Related papers (2020-01-17T12:32:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.