On the Stability Properties and the Optimization Landscape of Training
Problems with Squared Loss for Neural Networks and General Nonlinear Conic
Approximation Schemes
- URL: http://arxiv.org/abs/2011.03293v3
- Date: Thu, 2 Dec 2021 10:47:52 GMT
- Title: On the Stability Properties and the Optimization Landscape of Training
Problems with Squared Loss for Neural Networks and General Nonlinear Conic
Approximation Schemes
- Authors: Constantin Christof
- Abstract summary: We study the optimization landscape and the stability properties of training problems with squared loss for neural networks and general nonlinear conic approximation schemes.
We prove that the same effects that are responsible for these instability properties are also the reason for the emergence of saddle points and spurious local minima.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the optimization landscape and the stability properties of training
problems with squared loss for neural networks and general nonlinear conic
approximation schemes. It is demonstrated that, if a nonlinear conic
approximation scheme is considered that is (in an appropriately defined sense)
more expressive than a classical linear approximation approach and if there
exist unrealizable label vectors, then a training problem with squared loss is
necessarily unstable in the sense that its solution set depends discontinuously
on the label vector in the training data. We further prove that the same
effects that are responsible for these instability properties are also the
reason for the emergence of saddle points and spurious local minima, which may
be arbitrarily far away from global solutions, and that neither the instability
of the training problem nor the existence of spurious local minima can, in
general, be overcome by adding a regularization term to the objective function
that penalizes the size of the parameters in the approximation scheme. The
latter results are shown to be true regardless of whether the assumption of
realizability is satisfied or not. We demonstrate that our analysis in
particular applies to training problems for free-knot interpolation schemes and
deep and shallow neural networks with variable widths that involve an arbitrary
mixture of various activation functions (e.g., binary, sigmoid, tanh, arctan,
soft-sign, ISRU, soft-clip, SQNL, ReLU, leaky ReLU, soft-plus, bent identity,
SILU, ISRLU, and ELU). In summary, the findings of this paper illustrate that
the improved approximation properties of neural networks and general nonlinear
conic approximation instruments are linked in a direct and quantifiable way to
undesirable properties of the optimization problems that have to be solved in
order to train them.
Related papers
- GLinSAT: The General Linear Satisfiability Neural Network Layer By Accelerated Gradient Descent [12.409030267572243]
We first reformulate the neural network output projection problem as an entropy-regularized linear programming problem.
Based on an accelerated gradient descent algorithm with numerical performance enhancement, we present our architecture, GLinSAT, to solve the problem.
This is the first general linear satisfiability layer in which all the operations are differentiable and matrix-factorization-free.
arXiv Detail & Related papers (2024-09-26T03:12:53Z) - FEM-based Neural Networks for Solving Incompressible Fluid Flows and Related Inverse Problems [41.94295877935867]
numerical simulation and optimization of technical systems described by partial differential equations is expensive.
A comparatively new approach in this context is to combine the good approximation properties of neural networks with the classical finite element method.
In this paper, we extend this approach to saddle-point and non-linear fluid dynamics problems, respectively.
arXiv Detail & Related papers (2024-09-06T07:17:01Z) - Controlled Learning of Pointwise Nonlinearities in Neural-Network-Like Architectures [14.93489065234423]
We present a general variational framework for the training of freeform nonlinearities in layered computational architectures.
The slope constraints allow us to impose properties such as 1-Lipschitz stability, firm non-expansiveness, and monotonicity/invertibility.
We show how to solve the numerically function-optimization problem by representing the nonlinearities in a suitable (nonuniform) B-spline basis.
arXiv Detail & Related papers (2024-08-23T14:39:27Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Constrained Optimization via Exact Augmented Lagrangian and Randomized
Iterative Sketching [55.28394191394675]
We develop an adaptive inexact Newton method for equality-constrained nonlinear, nonIBS optimization problems.
We demonstrate the superior performance of our method on benchmark nonlinear problems, constrained logistic regression with data from LVM, and a PDE-constrained problem.
arXiv Detail & Related papers (2023-05-28T06:33:37Z) - On the ISS Property of the Gradient Flow for Single Hidden-Layer Neural
Networks with Linear Activations [0.0]
We investigate the effects of overfitting on the robustness of gradient-descent training when subject to uncertainty on the gradient estimation.
We show that the general overparametrized formulation introduces a set of spurious equilibria which lay outside the set where the loss function is minimized.
arXiv Detail & Related papers (2023-05-17T02:26:34Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Optimizing Mode Connectivity via Neuron Alignment [84.26606622400423]
Empirically, the local minima of loss functions can be connected by a learned curve in model space along which the loss remains nearly constant.
We propose a more general framework to investigate effect of symmetry on landscape connectivity by accounting for the weight permutations of networks being connected.
arXiv Detail & Related papers (2020-09-05T02:25:23Z) - Loss landscapes and optimization in over-parameterized non-linear
systems and neural networks [20.44438519046223]
We show that wide neural networks satisfy the PL$*$ condition, which explains the (S)GD convergence to a global minimum.
We show that wide neural networks satisfy the PL$*$ condition, which explains the (S)GD convergence to a global minimum.
arXiv Detail & Related papers (2020-02-29T17:18:28Z) - Convex Geometry and Duality of Over-parameterized Neural Networks [70.15611146583068]
We develop a convex analytic approach to analyze finite width two-layer ReLU networks.
We show that an optimal solution to the regularized training problem can be characterized as extreme points of a convex set.
In higher dimensions, we show that the training problem can be cast as a finite dimensional convex problem with infinitely many constraints.
arXiv Detail & Related papers (2020-02-25T23:05:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.