Analytic Characterization of the Hessian in Shallow ReLU Models: A Tale
of Symmetry
- URL: http://arxiv.org/abs/2008.01805v2
- Date: Thu, 15 Oct 2020 22:53:35 GMT
- Title: Analytic Characterization of the Hessian in Shallow ReLU Models: A Tale
of Symmetry
- Authors: Yossi Arjevani, Michael Field
- Abstract summary: We analytically characterize the Hessian at various families of spurious minima.
In particular, we prove that for $dge k$ standard Gaussian inputs: (a) of the $dk$ eigenvalues of the Hessian, $dk - O(d)$ concentrate near zero, (b) $Omega(d)$ of the eigenvalues grow linearly with $k$.
- Score: 9.695960412426672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the optimization problem associated with fitting two-layers ReLU
networks with respect to the squared loss, where labels are generated by a
target network. We leverage the rich symmetry structure to analytically
characterize the Hessian at various families of spurious minima in the natural
regime where the number of inputs $d$ and the number of hidden neurons $k$ is
finite. In particular, we prove that for $d\ge k$ standard Gaussian inputs: (a)
of the $dk$ eigenvalues of the Hessian, $dk - O(d)$ concentrate near zero, (b)
$\Omega(d)$ of the eigenvalues grow linearly with $k$. Although this phenomenon
of extremely skewed spectrum has been observed many times before, to our
knowledge, this is the first time it has been established {rigorously}. Our
analytic approach uses techniques, new to the field, from symmetry breaking and
representation theory, and carries important implications for our ability to
argue about statistical generalization through local curvature.
Related papers
- Exact Community Recovery (under Side Information): Optimality of Spectral Algorithms [1.4732811715354452]
We study the problem of exact community recovery in general, two-community block models.
We provide a unified analysis of the effect of side information on the information-theoretic limits of exact recovery.
arXiv Detail & Related papers (2024-06-18T21:48:59Z) - Information-Theoretic Thresholds for Planted Dense Cycles [52.076657911275525]
We study a random graph model for small-world networks which are ubiquitous in social and biological sciences.
For both detection and recovery of the planted dense cycle, we characterize the information-theoretic thresholds in terms of $n$, $tau$, and an edge-wise signal-to-noise ratio $lambda$.
arXiv Detail & Related papers (2024-02-01T03:39:01Z) - Effective Minkowski Dimension of Deep Nonparametric Regression: Function
Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures.
This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process.
We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator.
We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z) - Analytic Study of Families of Spurious Minima in Two-Layer ReLU Neural
Networks [15.711517003382484]
We show that the Hessian spectrum concentrates near positives, with the exception of $Theta(d)$ eigenvalues which growly with$d$.
This makes possible the creation and the annihilation of minima using powerful tools from bifurcation theory.
arXiv Detail & Related papers (2021-07-21T22:05:48Z) - Exponential ReLU Neural Network Approximation Rates for Point and Edge
Singularities [0.0]
We prove exponential expressivity with stable ReLU Neural Networks (ReLU NNs) in $H1(Omega)$ for weighted analytic function classes in certain polytopal domains.
The exponential approximation rates are shown to hold in space dimension $d = 2$ on Lipschitz polygons with straight sides, and in space dimension $d=3$ on Fichera-type polyhedral domains with plane faces.
arXiv Detail & Related papers (2020-10-23T07:44:32Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - The Generalized Lasso with Nonlinear Observations and Generative Priors [63.541900026673055]
We make the assumption of sub-Gaussian measurements, which is satisfied by a wide range of measurement models.
We show that our result can be extended to the uniform recovery guarantee under the assumption of a so-called local embedding property.
arXiv Detail & Related papers (2020-06-22T16:43:35Z) - Symmetry & critical points for a model shallow neural network [9.695960412426672]
We consider the optimization problem associated with fitting two-layer ReLU networks with $k$ hidden neurons.
We leverage the rich symmetry exhibited by such models to identify various families of critical points.
We show that while the loss function at certain types of spurious minima decays to zero like $k-1$, in other cases the loss converges to a strictly positive constant.
arXiv Detail & Related papers (2020-03-23T23:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.