The Effects of Mild Over-parameterization on the Optimization Landscape
of Shallow ReLU Neural Networks
- URL: http://arxiv.org/abs/2006.01005v2
- Date: Fri, 30 Jul 2021 15:47:15 GMT
- Title: The Effects of Mild Over-parameterization on the Optimization Landscape
of Shallow ReLU Neural Networks
- Authors: Itay Safran, Gilad Yehudai, Ohad Shamir
- Abstract summary: We prove that the objective is strongly convex around the global minima when the teacher and student networks possess the same number of neurons.
For the non-global minima, we prove that adding even just a single neuron will turn a non-global minimum into a saddle point.
- Score: 36.35321290763711
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the effects of mild over-parameterization on the optimization
landscape of a simple ReLU neural network of the form
$\mathbf{x}\mapsto\sum_{i=1}^k\max\{0,\mathbf{w}_i^{\top}\mathbf{x}\}$, in a
well-studied teacher-student setting where the target values are generated by
the same architecture, and when directly optimizing over the population squared
loss with respect to Gaussian inputs. We prove that while the objective is
strongly convex around the global minima when the teacher and student networks
possess the same number of neurons, it is not even \emph{locally convex} after
any amount of over-parameterization. Moreover, related desirable properties
(e.g., one-point strong convexity and the Polyak-{\L}ojasiewicz condition) also
do not hold even locally. On the other hand, we establish that the objective
remains one-point strongly convex in \emph{most} directions (suitably defined),
and show an optimization guarantee under this property. For the non-global
minima, we prove that adding even just a single neuron will turn a non-global
minimum into a saddle point. This holds under some technical conditions which
we validate empirically. These results provide a possible explanation for why
recovering a global minimum becomes significantly easier when we
over-parameterize, even if the amount of over-parameterization is very
moderate.
Related papers
- Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions.
We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Over-Parameterization Exponentially Slows Down Gradient Descent for
Learning a Single Neuron [49.45105570960104]
We prove the global convergence of randomly gradient descent with a $Oleft(T-3right)$ rate.
These two bounds jointly give an exact characterization of the convergence rate.
We show this potential function converges slowly, which implies the slow convergence rate of the loss function.
arXiv Detail & Related papers (2023-02-20T15:33:26Z) - When Expressivity Meets Trainability: Fewer than $n$ Neurons Can Work [59.29606307518154]
We show that as long as the width $m geq 2n/d$ (where $d$ is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss.
We also consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer.
arXiv Detail & Related papers (2022-10-21T14:41:26Z) - On the Optimization Landscape of Neural Collapse under MSE Loss: Global
Optimality with Unconstrained Features [38.05002597295796]
Collapselayers collapse to the vertices of a Simplex Equiangular Tight Frame (ETF)
An intriguing empirical phenomenon has been widely observed in the last-layers and features of deep neural networks for tasks.
arXiv Detail & Related papers (2022-03-02T17:00:18Z) - The loss landscape of deep linear neural networks: a second-order analysis [9.85879905918703]
We study the optimization landscape of deep linear neural networks with the square loss.
We characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points.
arXiv Detail & Related papers (2021-07-28T11:33:18Z) - Which Minimizer Does My Neural Network Converge To? [5.575448433529451]
We explain how common variants of the standard NN training procedure change the minimizer obtained.
We show that for adaptive optimization such as AdaGrad, the obtained minimizer generally differs from the gradient descent (GD) minimizer.
This adaptive minimizer is changed further by minibatch training, even though in the non-adaptive case, GD and GD result in essentially the same minimizer.
arXiv Detail & Related papers (2020-11-04T17:04:01Z) - The Hidden Convex Optimization Landscape of Two-Layer ReLU Neural
Networks: an Exact Characterization of the Optimal Solutions [51.60996023961886]
We prove that finding all globally optimal two-layer ReLU neural networks can be performed by solving a convex optimization program with cone constraints.
Our analysis is novel, characterizes all optimal solutions, and does not leverage duality-based analysis which was recently used to lift neural network training into convex spaces.
arXiv Detail & Related papers (2020-06-10T15:38:30Z) - Convex Geometry and Duality of Over-parameterized Neural Networks [70.15611146583068]
We develop a convex analytic approach to analyze finite width two-layer ReLU networks.
We show that an optimal solution to the regularized training problem can be characterized as extreme points of a convex set.
In higher dimensions, we show that the training problem can be cast as a finite dimensional convex problem with infinitely many constraints.
arXiv Detail & Related papers (2020-02-25T23:05:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.