Piecewise linear activations substantially shape the loss surfaces of
neural networks
- URL: http://arxiv.org/abs/2003.12236v1
- Date: Fri, 27 Mar 2020 04:59:34 GMT
- Title: Piecewise linear activations substantially shape the loss surfaces of
neural networks
- Authors: Fengxiang He, Bohan Wang, Dacheng Tao
- Abstract summary: This paper presents how piecewise linear activation functions substantially shape the loss surfaces of neural networks.
We first prove that it the loss surfaces of many neural networks have infinite spurious local minima which are defined as the local minima with higher empirical risks than the global minima.
For one-hidden-layer networks, we prove that all local minima in a cell constitute an equivalence class; they are concentrated in a valley; and they are all global minima in the cell.
- Score: 95.73230376153872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the loss surface of a neural network is fundamentally important
to the understanding of deep learning. This paper presents how piecewise linear
activation functions substantially shape the loss surfaces of neural networks.
We first prove that {\it the loss surfaces of many neural networks have
infinite spurious local minima} which are defined as the local minima with
higher empirical risks than the global minima. Our result demonstrates that the
networks with piecewise linear activations possess substantial differences to
the well-studied linear neural networks. This result holds for any neural
network with arbitrary depth and arbitrary piecewise linear activation
functions (excluding linear functions) under most loss functions in practice.
Essentially, the underlying assumptions are consistent with most practical
circumstances where the output layer is narrower than any hidden layer. In
addition, the loss surface of a neural network with piecewise linear
activations is partitioned into multiple smooth and multilinear cells by
nondifferentiable boundaries. The constructed spurious local minima are
concentrated in one cell as a valley: they are connected with each other by a
continuous path, on which empirical risk is invariant. Further for
one-hidden-layer networks, we prove that all local minima in a cell constitute
an equivalence class; they are concentrated in a valley; and they are all
global minima in the cell.
Related papers
- Addressing caveats of neural persistence with deep graph persistence [54.424983583720675]
We find that the variance of network weights and spatial concentration of large weights are the main factors that impact neural persistence.
We propose an extension of the filtration underlying neural persistence to the whole neural network instead of single layers.
This yields our deep graph persistence measure, which implicitly incorporates persistent paths through the network and alleviates variance-related issues.
arXiv Detail & Related papers (2023-07-20T13:34:11Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Rank Diminishing in Deep Neural Networks [71.03777954670323]
Rank of neural networks measures information flowing across layers.
It is an instance of a key structural condition that applies across broad domains of machine learning.
For neural networks, however, the intrinsic mechanism that yields low-rank structures remains vague and unclear.
arXiv Detail & Related papers (2022-06-13T12:03:32Z) - On the Omnipresence of Spurious Local Minima in Certain Neural Network
Training Problems [0.0]
We study the loss landscape of training problems for deep artificial neural networks with a one-dimensional real output.
It is shown that such problems possess a continuum of spurious (i.e., not globally optimal) local minima for all target functions that are not affine.
arXiv Detail & Related papers (2022-02-23T14:41:54Z) - Exact Solutions of a Deep Linear Network [2.2344764434954256]
This work finds the analytical expression of the global minima of a deep linear network with weight decay and neurons.
We show that weight decay strongly interacts with the model architecture and can create bad minima at zero in a network with more than $1$ hidden layer.
arXiv Detail & Related papers (2022-02-10T00:13:34Z) - Spurious Local Minima Are Common for Deep Neural Networks with Piecewise
Linear Activations [4.758120194113354]
spurious local minima are common for deep fully-connected networks and CNNs with piecewise linear activation functions.
A motivating example is given to explain the reason for the existence of spurious local minima.
arXiv Detail & Related papers (2021-02-25T23:51:14Z) - The Connection Between Approximation, Depth Separation and Learnability
in Neural Networks [70.55686685872008]
We study the connection between learnability and approximation capacity.
We show that learnability with deep networks of a target function depends on the ability of simpler classes to approximate the target.
arXiv Detail & Related papers (2021-01-31T11:32:30Z) - Topological obstructions in neural networks learning [67.8848058842671]
We study global properties of the loss gradient function flow.
We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface.
arXiv Detail & Related papers (2020-12-31T18:53:25Z) - Over-parametrized neural networks as under-determined linear systems [31.69089186688224]
We show that it is unsurprising simple neural networks can achieve zero training loss.
We show that kernels typically associated with the ReLU activation function have fundamental flaws.
We propose new activation functions that avoid the pitfalls of ReLU in that they admit zero training loss solutions for any set of distinct data points.
arXiv Detail & Related papers (2020-10-29T21:43:00Z) - Avoiding Spurious Local Minima in Deep Quadratic Networks [0.0]
We characterize the landscape of the mean squared nonlinear error for networks with neural activation functions.
We prove that deepized neural networks with quadratic activations benefit from similar landscape properties.
arXiv Detail & Related papers (2019-12-31T22:31:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.