Analysis of One-Hidden-Layer Neural Networks via the Resolvent Method
- URL: http://arxiv.org/abs/2105.05115v1
- Date: Tue, 11 May 2021 15:17:39 GMT
- Title: Analysis of One-Hidden-Layer Neural Networks via the Resolvent Method
- Authors: Vanessa Piccolo and Dominik Schr\"oder
- Abstract summary: Motivated by random neural networks, we consider the random matrix $M = Y Yast$ with $Y = f(WX)$.
We prove that the Stieltjes transform of the limiting spectral distribution satisfies a quartic self-consistent equation up to some error terms.
In addition, we extend the previous results to the case of additive bias $Y=f(WX+B)$ with $B$ being an independent rank-one Gaussian random matrix.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We compute the asymptotic empirical spectral distribution of a non-linear
random matrix model by using the resolvent method. Motivated by random neural
networks, we consider the random matrix $M = Y Y^\ast$ with $Y = f(WX)$, where
$W$ and $X$ are random rectangular matrices with i.i.d. centred entries and $f$
is a non-linear smooth function which is applied entry-wise. We prove that the
Stieltjes transform of the limiting spectral distribution satisfies a quartic
self-consistent equation up to some error terms, which is exactly the equation
obtained by [Pennington, Worah] and [Benigni, P\'{e}ch\'{e}] with the moment
method approach. In addition, we extend the previous results to the case of
additive bias $Y=f(WX+B)$ with $B$ being an independent rank-one Gaussian
random matrix, closer modelling the neural network infrastructures encountering
in practice. Our approach following the \emph{resolvent method} is more robust
than the moment method and is expected to provide insights also for models
where the combinatorics of the latter become intractable.
Related papers
- An approximation of the $S$ matrix for solving the Marchenko equation [0.0]
I present a new approximation of the $S$-matrix dependence on momentum $q$, formulated as a sum of a rational function and a truncated Sinc series.
This approach enables pointwise determination of the $S$ matrix with specified resolution, capturing essential features such as resonance behavior with high accuracy.
arXiv Detail & Related papers (2024-10-27T11:06:28Z) - An Equivalence Principle for the Spectrum of Random Inner-Product Kernel
Matrices with Polynomial Scalings [21.727073594338297]
This study is motivated by applications in machine learning and statistics.
We establish the weak limit of the empirical distribution of these random matrices in a scaling regime.
Our results can be characterized as the free additive convolution between a Marchenko-Pastur law and a semicircle law.
arXiv Detail & Related papers (2022-05-12T18:50:21Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - Largest Eigenvalues of the Conjugate Kernel of Single-Layered Neural
Networks [0.0]
We show that the largest eigenvalue has the same limit (in probability) as that of some well-known linear random matrix ensembles.
This may be of interest for applications to machine learning.
arXiv Detail & Related papers (2022-01-13T00:48:20Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Deformed semicircle law and concentration of nonlinear random matrices
for ultra-wide neural networks [29.03095282348978]
We study the limiting spectral distributions of two empirical kernel matrices associated with $f(X)$.
We show that random feature regression induced by the empirical kernel achieves the same performance as its limiting kernel regression under the ultra-wide regime.
arXiv Detail & Related papers (2021-09-20T05:25:52Z) - Feature Cross Search via Submodular Optimization [58.15569071608769]
We study feature cross search as a fundamental primitive in feature engineering.
We show that there exists a simple greedy $(1-1/e)$-approximation algorithm for this problem.
arXiv Detail & Related papers (2021-07-05T16:58:31Z) - Linear-Sample Learning of Low-Rank Distributions [56.59844655107251]
We show that learning $ktimes k$, rank-$r$, matrices to normalized $L_1$ distance requires $Omega(frackrepsilon2)$ samples.
We propose an algorithm that uses $cal O(frackrepsilon2log2fracepsilon)$ samples, a number linear in the high dimension, and nearly linear in the matrices, typically low, rank proofs.
arXiv Detail & Related papers (2020-09-30T19:10:32Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z) - Tight Nonparametric Convergence Rates for Stochastic Gradient Descent
under the Noiseless Linear Model [0.0]
We analyze the convergence of single-pass, fixed step-size gradient descent on the least-square risk under this model.
As a special case, we analyze an online algorithm for estimating a real function on the unit interval from the noiseless observation of its value at randomly sampled points.
arXiv Detail & Related papers (2020-06-15T08:25:50Z) - Linear Time Sinkhorn Divergences using Positive Features [51.50788603386766]
Solving optimal transport with an entropic regularization requires computing a $ntimes n$ kernel matrix that is repeatedly applied to a vector.
We propose to use instead ground costs of the form $c(x,y)=-logdotpvarphi(x)varphi(y)$ where $varphi$ is a map from the ground space onto the positive orthant $RRr_+$, with $rll n$.
arXiv Detail & Related papers (2020-06-12T10:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.