On the Omnipresence of Spurious Local Minima in Certain Neural Network
Training Problems
- URL: http://arxiv.org/abs/2202.12262v2
- Date: Thu, 15 Jun 2023 11:32:47 GMT
- Title: On the Omnipresence of Spurious Local Minima in Certain Neural Network
Training Problems
- Authors: Constantin Christof and Julia Kowalczyk
- Abstract summary: We study the loss landscape of training problems for deep artificial neural networks with a one-dimensional real output.
It is shown that such problems possess a continuum of spurious (i.e., not globally optimal) local minima for all target functions that are not affine.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the loss landscape of training problems for deep artificial neural
networks with a one-dimensional real output whose activation functions contain
an affine segment and whose hidden layers have width at least two. It is shown
that such problems possess a continuum of spurious (i.e., not globally optimal)
local minima for all target functions that are not affine. In contrast to
previous works, our analysis covers all sampling and parameterization regimes,
general differentiable loss functions, arbitrary continuous nonpolynomial
activation functions, and both the finite- and infinite-dimensional setting. It
is further shown that the appearance of the spurious local minima in the
considered training problems is a direct consequence of the universal
approximation theorem and that the underlying mechanisms also cause, e.g.,
$L^p$-best approximation problems to be ill-posed in the sense of Hadamard for
all networks that do not have a dense image. The latter result also holds
without the assumption of local affine linearity and without any conditions on
the hidden layers.
Related papers
- Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Neural Collapse with Normalized Features: A Geometric Analysis over the
Riemannian Manifold [30.3185037354742]
When training over normalized deep networks for classification tasks, the learned features exhibit a so-called "neural collapse" phenomenon.
We show that better representations can be learned faster via feature normalization.
arXiv Detail & Related papers (2022-09-19T17:26:32Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - An Unconstrained Layer-Peeled Perspective on Neural Collapse [20.75423143311858]
We introduce a surrogate model called the unconstrained layer-peeled model (ULPM)
We prove that gradient flow on this model converges to critical points of a minimum-norm separation problem exhibiting neural collapse in its global minimizer.
We show that our results also hold during the training of neural networks in real-world tasks when explicit regularization or weight decay is not used.
arXiv Detail & Related papers (2021-10-06T14:18:47Z) - The loss landscape of deep linear neural networks: a second-order analysis [9.85879905918703]
We study the optimization landscape of deep linear neural networks with the square loss.
We characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points.
arXiv Detail & Related papers (2021-07-28T11:33:18Z) - Topological obstructions in neural networks learning [67.8848058842671]
We study global properties of the loss gradient function flow.
We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface.
arXiv Detail & Related papers (2020-12-31T18:53:25Z) - Generalization bound of globally optimal non-convex neural network
training: Transportation map estimation by infinite dimensional Langevin
dynamics [50.83356836818667]
We introduce a new theoretical framework to analyze deep learning optimization with connection to its generalization error.
Existing frameworks such as mean field theory and neural tangent kernel theory for neural network optimization analysis typically require taking limit of infinite width of the network to show its global convergence.
arXiv Detail & Related papers (2020-07-11T18:19:50Z) - The Hidden Convex Optimization Landscape of Two-Layer ReLU Neural
Networks: an Exact Characterization of the Optimal Solutions [51.60996023961886]
We prove that finding all globally optimal two-layer ReLU neural networks can be performed by solving a convex optimization program with cone constraints.
Our analysis is novel, characterizes all optimal solutions, and does not leverage duality-based analysis which was recently used to lift neural network training into convex spaces.
arXiv Detail & Related papers (2020-06-10T15:38:30Z) - Piecewise linear activations substantially shape the loss surfaces of
neural networks [95.73230376153872]
This paper presents how piecewise linear activation functions substantially shape the loss surfaces of neural networks.
We first prove that it the loss surfaces of many neural networks have infinite spurious local minima which are defined as the local minima with higher empirical risks than the global minima.
For one-hidden-layer networks, we prove that all local minima in a cell constitute an equivalence class; they are concentrated in a valley; and they are all global minima in the cell.
arXiv Detail & Related papers (2020-03-27T04:59:34Z) - Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks
Trained with the Logistic Loss [0.0]
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks.
We analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations.
arXiv Detail & Related papers (2020-02-11T15:42:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.