Geometry of the Loss Landscape in Overparameterized Neural Networks:
Symmetries and Invariances
- URL: http://arxiv.org/abs/2105.12221v1
- Date: Tue, 25 May 2021 21:19:07 GMT
- Title: Geometry of the Loss Landscape in Overparameterized Neural Networks:
Symmetries and Invariances
- Authors: Berfin \c{S}im\c{s}ek, Fran\c{c}ois Ged, Arthur Jacot, Francesco
Spadaro, Cl\'ement Hongler, Wulfram Gerstner, Johanni Brea
- Abstract summary: We show that adding one extra neuron to each is sufficient to connect all previously discrete minima into a single manifold.
We show that the number of symmetry-induced critical subspaces dominates the number of affine subspaces forming the global minima manifold.
- Score: 9.390008801320024
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study how permutation symmetries in overparameterized multi-layer neural
networks generate `symmetry-induced' critical points. Assuming a network with $
L $ layers of minimal widths $ r_1^*, \ldots, r_{L-1}^* $ reaches a zero-loss
minimum at $ r_1^*! \cdots r_{L-1}^*! $ isolated points that are permutations
of one another, we show that adding one extra neuron to each layer is
sufficient to connect all these previously discrete minima into a single
manifold. For a two-layer overparameterized network of width $ r^*+ h =: m $ we
explicitly describe the manifold of global minima: it consists of $ T(r^*, m) $
affine subspaces of dimension at least $ h $ that are connected to one another.
For a network of width $m$, we identify the number $G(r,m)$ of affine subspaces
containing only symmetry-induced critical points that are related to the
critical points of a smaller network of width $r<r^*$. Via a combinatorial
analysis, we derive closed-form formulas for $ T $ and $ G $ and show that the
number of symmetry-induced critical subspaces dominates the number of affine
subspaces forming the global minima manifold in the mildly overparameterized
regime (small $ h $) and vice versa in the vastly overparameterized regime ($h
\gg r^*$). Our results provide new insights into the minimization of the
non-convex loss function of overparameterized neural networks.
Related papers
- Deep Neural Networks: Multi-Classification and Universal Approximation [0.0]
We demonstrate that a ReLU deep neural network with a width of $2$ and a depth of $2N+4M-1$ layers can achieve finite sample memorization for any dataset comprising $N$ elements.
We also provide depth estimates for approximating $W1,p$ functions and width estimates for approximating $Lp(Omega;mathbbRm)$ for $mgeq1$.
arXiv Detail & Related papers (2024-09-10T14:31:21Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data.
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Nearly Optimal Regret for Decentralized Online Convex Optimization [53.433398074919]
Decentralized online convex optimization (D-OCO) aims to minimize a sequence of global loss functions using only local computations and communications.
We develop novel D-OCO algorithms that can respectively reduce the regret bounds for convex and strongly convex functions.
Our algorithms are nearly optimal in terms of $T$, $n$, and $rho$.
arXiv Detail & Related papers (2024-02-14T13:44:16Z) - Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization [1.189367612437469]
We consider shallow neural networks with one hidden layer, a ReLU activation function, an $mathcal L2$ Schatten class (or Hilbert-Schmidt) cost function.
We prove an upper bound on the minimum of the cost function of order $O(delta_P)$ where $delta_P$ measures the signal to noise ratio of training inputs.
In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Qleq M$ by a
arXiv Detail & Related papers (2023-09-19T07:12:41Z) - Effective Minkowski Dimension of Deep Nonparametric Regression: Function
Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures.
This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - Neural Networks Efficiently Learn Low-Dimensional Representations with
SGD [22.703825902761405]
We show that SGD-trained ReLU NNs can learn a single-index target of the form $y=f(langleboldsymbolu,boldsymbolxrangle) + epsilon$ by recovering the principal direction.
We also provide compress guarantees for NNs using the approximate low-rank structure produced by SGD.
arXiv Detail & Related papers (2022-09-29T15:29:10Z) - Shallow neural network representation of polynomials [91.3755431537592]
We show that $d$-variables of degreeR$ can be represented on $[0,1]d$ as shallow neural networks of width $d+1+sum_r=2Rbinomr+d-1d-1d-1[binomr+d-1d-1d-1[binomr+d-1d-1d-1[binomr+d-1d-1d-1d-1[binomr+d-1d-1d-1d-1
arXiv Detail & Related papers (2022-08-17T08:14:52Z) - On minimal representations of shallow ReLU networks [0.0]
We show that the minimal representation for $f$ uses either $n$, $n+1$ or $n+2$ neurons.
In particular, where the input layer is one-dimensional, minimal representations always use at most $n+1$ neurons but in all higher dimensional settings there are functions for which $n+2$ neurons are needed.
arXiv Detail & Related papers (2021-08-12T10:22:24Z) - Small Covers for Near-Zero Sets of Polynomials and Learning Latent
Variable Models [56.98280399449707]
We show that there exists an $epsilon$-cover for $S$ of cardinality $M = (k/epsilon)O_d(k1/d)$.
Building on our structural result, we obtain significantly improved learning algorithms for several fundamental high-dimensional probabilistic models hidden variables.
arXiv Detail & Related papers (2020-12-14T18:14:08Z) - A deep network construction that adapts to intrinsic dimensionality
beyond the domain [79.23797234241471]
We study the approximation of two-layer compositions $f(x) = g(phi(x))$ via deep networks with ReLU activation.
We focus on two intuitive and practically relevant choices for $phi$: the projection onto a low-dimensional embedded submanifold and a distance to a collection of low-dimensional sets.
arXiv Detail & Related papers (2020-08-06T09:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.