Memorizing Gaussians with no over-parameterizaion via gradient decent on
neural networks
- URL: http://arxiv.org/abs/2003.12895v1
- Date: Sat, 28 Mar 2020 21:45:42 GMT
- Title: Memorizing Gaussians with no over-parameterizaion via gradient decent on
neural networks
- Authors: Amit Daniely
- Abstract summary: We prove that a single step of decent gradient over depth two network, with $q$ hidden neurons, can memorize $Omegaleft(fracdqlog4(d)right)$ independent and randomly labeled Gaussians in $mathbbRd$.
The result is valid for a large class of activation functions, which includes the absolute value.
- Score: 27.374589803147025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We prove that a single step of gradient decent over depth two network, with
$q$ hidden neurons, starting from orthogonal initialization, can memorize
$\Omega\left(\frac{dq}{\log^4(d)}\right)$ independent and randomly labeled
Gaussians in $\mathbb{R}^d$. The result is valid for a large class of
activation functions, which includes the absolute value.
Related papers
- Emergence and scaling laws in SGD learning of shallow neural networks [46.632052892298375]
We study the complexity of online gradient descent (SGD) for learning a two-layer neural network with $P$ neurons on isotropic Gaussian data.
We provide a precise analysis of SGD dynamics for the training of a student two-layer network to minimize the mean squared error (MSE) objective.
arXiv Detail & Related papers (2025-04-28T16:58:55Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data.
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Memory capacity of two layer neural networks with smooth activations [27.33243506775655]
We determine the memory capacity of two layer neural networks with $m$ hidden neurons and input dimension $d$.
We derive the precise generic rank of the network's Jacobian, which can be written in terms of Hadamard powers.
Our approach differs from prior works on memory capacity and holds promise for extending to deeper models.
arXiv Detail & Related papers (2023-08-03T19:31:15Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - Deformed semicircle law and concentration of nonlinear random matrices
for ultra-wide neural networks [29.03095282348978]
We study the limiting spectral distributions of two empirical kernel matrices associated with $f(X)$.
We show that random feature regression induced by the empirical kernel achieves the same performance as its limiting kernel regression under the ultra-wide regime.
arXiv Detail & Related papers (2021-09-20T05:25:52Z) - Efficient Algorithms for Learning Depth-2 Neural Networks with General
ReLU Activations [27.244958998196623]
We present time and sample efficient algorithms for learning an unknown depth-2 feedforward neural network with general ReLU activations.
In particular, we consider learning an unknown network of the form $f(x) = amathsfTsigma(WmathsfTx+b)$, where $x$ is drawn from the Gaussian distribution, and $sigma(t) := max(t,0)$ is the ReLU activation.
arXiv Detail & Related papers (2021-07-21T17:06:03Z) - Deep neural network approximation of analytic functions [91.3755431537592]
entropy bound for the spaces of neural networks with piecewise linear activation functions.
We derive an oracle inequality for the expected error of the considered penalized deep neural network estimators.
arXiv Detail & Related papers (2021-04-05T18:02:04Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z) - Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions [84.49087114959872]
We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonsmooth functions.
In particular, we study Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions.
arXiv Detail & Related papers (2020-02-10T23:23:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.