Overparametrized linear dimensionality reductions: From projection
pursuit to two-layer neural networks
- URL: http://arxiv.org/abs/2206.06526v1
- Date: Tue, 14 Jun 2022 00:07:33 GMT
- Title: Overparametrized linear dimensionality reductions: From projection
pursuit to two-layer neural networks
- Authors: Andrea Montanari and Kangjie Zhou
- Abstract summary: Given a cloud of $n$ data points in $mathbbRd$, consider all projections onto $m$-dimensional subspaces of $mathbbRd$.
What does this collection of probability distributions look like when $n,d$ grow large?
Denoting by $mathscrF_m, alpha$ the set of probability distributions in $mathbbRm$ that arise as low-dimensional projections in this limit, we establish new inner and outer bounds on $mathscrF_
- Score: 10.368585938419619
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a cloud of $n$ data points in $\mathbb{R}^d$, consider all projections
onto $m$-dimensional subspaces of $\mathbb{R}^d$ and, for each such projection,
the empirical distribution of the projected points. What does this collection
of probability distributions look like when $n,d$ grow large?
We consider this question under the null model in which the points are i.i.d.
standard Gaussian vectors, focusing on the asymptotic regime in which
$n,d\to\infty$, with $n/d\to\alpha\in (0,\infty)$, while $m$ is fixed. Denoting
by $\mathscr{F}_{m, \alpha}$ the set of probability distributions in
$\mathbb{R}^m$ that arise as low-dimensional projections in this limit, we
establish new inner and outer bounds on $\mathscr{F}_{m, \alpha}$. In
particular, we characterize the Wasserstein radius of $\mathscr{F}_{m,\alpha}$
up to logarithmic factors, and determine it exactly for $m=1$. We also prove
sharp bounds in terms of Kullback-Leibler divergence and R\'{e}nyi information
dimension.
The previous question has application to unsupervised learning methods, such
as projection pursuit and independent component analysis. We introduce a
version of the same problem that is relevant for supervised learning, and prove
a sharp Wasserstein radius bound. As an application, we establish an upper
bound on the interpolation threshold of two-layers neural networks with $m$
hidden neurons.
Related papers
- Conditional regression for the Nonlinear Single-Variable Model [4.565636963872865]
We consider a model $F(X):=f(Pi_gamma):mathbbRdto[0,rmlen_gamma]$ where $Pi_gamma: [0,rmlen_gamma]tomathbbRd$ and $f:[0,rmlen_gamma]tomathbbR1$.
We propose a nonparametric estimator, based on conditional regression, and show that it can achieve the $one$-dimensional optimal min-max rate
arXiv Detail & Related papers (2024-11-14T18:53:51Z) - Dimension-free Private Mean Estimation for Anisotropic Distributions [55.86374912608193]
Previous private estimators on distributions over $mathRd suffer from a curse of dimensionality.
We present an algorithm whose sample complexity has improved dependence on dimension.
arXiv Detail & Related papers (2024-11-01T17:59:53Z) - Which exceptional low-dimensional projections of a Gaussian point cloud can be found in polynomial time? [8.74634652691576]
We study the subset $mathscrF_m,alpha$ of distributions that can be realized by a class of iterative algorithms.
Non-rigorous methods from statistical physics yield an indirect characterization of $mathscrF_m,alpha$ in terms of a generalized Parisi formula.
arXiv Detail & Related papers (2024-06-05T05:54:56Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Debiasing and a local analysis for population clustering using
semidefinite programming [1.9761774213809036]
We consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of $2$ sub-gaussian distributions.
This work is motivated by the application of clustering individuals according to their population of origin.
arXiv Detail & Related papers (2024-01-16T03:14:24Z) - Estimation and Inference in Distributional Reinforcement Learning [28.253677740976197]
We show that a dataset of size $widetilde Oleft(frac|mathcalS||mathcalA|epsilon2 (1-gamma)4right)$ suffices to ensure the Kolmogorov metric and total variation metric between $hatetapi$ and $etapi$ is below $epsilon$ with high probability.
Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $etapi$.
arXiv Detail & Related papers (2023-09-29T14:14:53Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Learning (Very) Simple Generative Models Is Hard [45.13248517769758]
We show that no-time algorithm can solve problem even when output coordinates of $mathbbRdtobbRd'$ are one-hidden-layer ReLU networks with $mathrmpoly(d)$ neurons.
Key ingredient in our proof is an ODE-based construction of a compactly supported, piecewise-linear function $f$ with neurally-bounded slopes such that the pushforward of $mathcalN(0,1)$ under $f$ matches all low-degree moments of $mathcal
arXiv Detail & Related papers (2022-05-31T17:59:09Z) - Optimal Spectral Recovery of a Planted Vector in a Subspace [80.02218763267992]
We study efficient estimation and detection of a planted vector $v$ whose $ell_4$ norm differs from that of a Gaussian vector with the same $ell$ norm.
We show that in the regime $n rho gg sqrtN$, any spectral method from a large class (and more generally, any low-degree of the input) fails to detect the planted vector.
arXiv Detail & Related papers (2021-05-31T16:10:49Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.