Related papers: Spurious Local Minima Are Common for Deep Neural Networks with Piecewise Linear Activations

Spurious Local Minima Are Common for Deep Neural Networks with Piecewise Linear Activations

URL: http://arxiv.org/abs/2102.13233v1
Date: Thu, 25 Feb 2021 23:51:14 GMT
Title: Spurious Local Minima Are Common for Deep Neural Networks with Piecewise Linear Activations
Authors: Bo Liu
Abstract summary: spurious local minima are common for deep fully-connected networks and CNNs with piecewise linear activation functions. A motivating example is given to explain the reason for the existence of spurious local minima.
Score: 4.758120194113354
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In this paper, it is shown theoretically that spurious local minima are common for deep fully-connected networks and convolutional neural networks (CNNs) with piecewise linear activation functions and datasets that cannot be fitted by linear models. A motivating example is given to explain the reason for the existence of spurious local minima: each output neuron of deep fully-connected networks and CNNs with piecewise linear activations produces a continuous piecewise linear (CPWL) output, and different pieces of CPWL output can fit disjoint groups of data samples when minimizing the empirical risk. Fitting data samples with different CPWL functions usually results in different levels of empirical risk, leading to prevalence of spurious local minima. This result is proved in general settings with any continuous loss function. The main proof technique is to represent a CPWL function as a maximization over minimization of linear pieces. Deep ReLU networks are then constructed to produce these linear pieces and implement maximization and minimization operations.

Related papers

Physics-Informed Neural Networks: Minimizing Residual Loss with Wide Networks and Effective Activations [5.731640425517324]
We show that under certain conditions, the residual loss of PINNs can be globally minimized by a wide neural network. An activation function with well-behaved high-order derivatives plays a crucial role in minimizing the residual loss. The established theory paves the way for designing and choosing effective activation functions for PINNs.
arXiv Detail & Related papers (2024-05-02T19:08:59Z)
Linear Mode Connectivity in Sparse Neural Networks [1.30536490219656]
We study how neural network pruning with synthetic data leads to sparse networks with unique training properties. We find that these properties lead to syntheticworks matching the performance of traditional IMP with up to 150x less training points in settings where distilled data applies.
arXiv Detail & Related papers (2023-10-28T17:51:39Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models [9.96121040675476]
This manuscript explores how properties of functions learned by neural networks of depth greater than two layers affect predictions. Our framework considers a family of networks of varying depths that all have the same capacity but different representation costs.
arXiv Detail & Related papers (2023-05-24T22:10:12Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons. Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z)
On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems [0.0]
We study the loss landscape of training problems for deep artificial neural networks with a one-dimensional real output. It is shown that such problems possess a continuum of spurious (i.e., not globally optimal) local minima for all target functions that are not affine.
arXiv Detail & Related papers (2022-02-23T14:41:54Z)
Topological obstructions in neural networks learning [67.8848058842671]
We study global properties of the loss gradient function flow. We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface.
arXiv Detail & Related papers (2020-12-31T18:53:25Z)
Measuring Model Complexity of Neural Networks with Curve Activation Functions [100.98319505253797]
We propose the linear approximation neural network (LANN) to approximate a given deep model with curve activation function. We experimentally explore the training process of neural networks and detect overfitting. We find that the $L1$ and $L2$ regularizations suppress the increase of model complexity.
arXiv Detail & Related papers (2020-06-16T07:38:06Z)
Piecewise linear activations substantially shape the loss surfaces of neural networks [95.73230376153872]
This paper presents how piecewise linear activation functions substantially shape the loss surfaces of neural networks. We first prove that it the loss surfaces of many neural networks have infinite spurious local minima which are defined as the local minima with higher empirical risks than the global minima. For one-hidden-layer networks, we prove that all local minima in a cell constitute an equivalence class; they are concentrated in a valley; and they are all global minima in the cell.
arXiv Detail & Related papers (2020-03-27T04:59:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.