Symmetry & critical points for a model shallow neural network
- URL: http://arxiv.org/abs/2003.10576v5
- Date: Thu, 11 Mar 2021 11:08:17 GMT
- Title: Symmetry & critical points for a model shallow neural network
- Authors: Yossi Arjevani and Michael Field
- Abstract summary: We consider the optimization problem associated with fitting two-layer ReLU networks with $k$ hidden neurons.
We leverage the rich symmetry exhibited by such models to identify various families of critical points.
We show that while the loss function at certain types of spurious minima decays to zero like $k-1$, in other cases the loss converges to a strictly positive constant.
- Score: 9.695960412426672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the optimization problem associated with fitting two-layer ReLU
networks with $k$ hidden neurons, where labels are assumed to be generated by a
(teacher) neural network. We leverage the rich symmetry exhibited by such
models to identify various families of critical points and express them as
power series in $k^{-\frac{1}{2}}$. These expressions are then used to derive
estimates for several related quantities which imply that not all spurious
minima are alike. In particular, we show that while the loss function at
certain types of spurious minima decays to zero like $k^{-1}$, in other cases
the loss converges to a strictly positive constant. The methods used depend on
symmetry, the geometry of group actions, bifurcation, and Artin's implicit
function theorem.
Related papers
- A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Hidden Minima in Two-Layer ReLU Networks [7.23389716633927]
Two types of infinite families of spurious minima, giving one minimum per $d$, were recently found.
The loss at minima belonging to the first type converges to zero as $d$ increases.
In the second type, the loss remains bounded away from zero.
arXiv Detail & Related papers (2023-12-28T04:27:15Z) - Kernel-based off-policy estimation without overlap: Instance optimality
beyond semiparametric efficiency [53.90687548731265]
We study optimal procedures for estimating a linear functional based on observational data.
For any convex and symmetric function class $mathcalF$, we derive a non-asymptotic local minimax bound on the mean-squared error.
arXiv Detail & Related papers (2023-01-16T02:57:37Z) - Annihilation of Spurious Minima in Two-Layer ReLU Networks [9.695960412426672]
We study the optimization problem associated with fitting two-layer ReLU neural networks with respect to the squared loss.
We show that adding neurons can turn symmetric spurious minima into saddles.
We also prove the existence of descent directions in certain subspaces arising from the symmetry structure of the loss function.
arXiv Detail & Related papers (2022-10-12T11:04:21Z) - The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability.
We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z) - Analytic Study of Families of Spurious Minima in Two-Layer ReLU Neural
Networks [15.711517003382484]
We show that the Hessian spectrum concentrates near positives, with the exception of $Theta(d)$ eigenvalues which growly with$d$.
This makes possible the creation and the annihilation of minima using powerful tools from bifurcation theory.
arXiv Detail & Related papers (2021-07-21T22:05:48Z) - Geometry of the Loss Landscape in Overparameterized Neural Networks:
Symmetries and Invariances [9.390008801320024]
We show that adding one extra neuron to each is sufficient to connect all previously discrete minima into a single manifold.
We show that the number of symmetry-induced critical subspaces dominates the number of affine subspaces forming the global minima manifold.
arXiv Detail & Related papers (2021-05-25T21:19:07Z) - Deep neural network approximation of analytic functions [91.3755431537592]
entropy bound for the spaces of neural networks with piecewise linear activation functions.
We derive an oracle inequality for the expected error of the considered penalized deep neural network estimators.
arXiv Detail & Related papers (2021-04-05T18:02:04Z) - Analytic Characterization of the Hessian in Shallow ReLU Models: A Tale
of Symmetry [9.695960412426672]
We analytically characterize the Hessian at various families of spurious minima.
In particular, we prove that for $dge k$ standard Gaussian inputs: (a) of the $dk$ eigenvalues of the Hessian, $dk - O(d)$ concentrate near zero, (b) $Omega(d)$ of the eigenvalues grow linearly with $k$.
arXiv Detail & Related papers (2020-08-04T20:08:35Z) - Semiparametric Nonlinear Bipartite Graph Representation Learning with
Provable Guarantees [106.91654068632882]
We consider the bipartite graph and formalize its representation learning problem as a statistical estimation problem of parameters in a semiparametric exponential family distribution.
We show that the proposed objective is strongly convex in a neighborhood around the ground truth, so that a gradient descent-based method achieves linear convergence rate.
Our estimator is robust to any model misspecification within the exponential family, which is validated in extensive experiments.
arXiv Detail & Related papers (2020-03-02T16:40:36Z) - Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions [84.49087114959872]
We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonsmooth functions.
In particular, we study Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions.
arXiv Detail & Related papers (2020-02-10T23:23:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.