Universality of empirical risk minimization
- URL: http://arxiv.org/abs/2202.08832v1
- Date: Thu, 17 Feb 2022 18:53:45 GMT
- Title: Universality of empirical risk minimization
- Authors: Andrea Montanari and Basil Saeed
- Abstract summary: Consider supervised learning from i.i.d. samples where $boldsymbol x_i inmathbbRp$ are feature vectors and $y in mathbbR$ are labels.
We study empirical risk universality over a class of functions that are parameterized by $mathsfk.
- Score: 12.764655736673749
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Consider supervised learning from i.i.d. samples $\{{\boldsymbol
x}_i,y_i\}_{i\le n}$ where ${\boldsymbol x}_i \in\mathbb{R}^p$ are feature
vectors and ${y} \in \mathbb{R}$ are labels. We study empirical risk
minimization over a class of functions that are parameterized by $\mathsf{k} =
O(1)$ vectors ${\boldsymbol \theta}_1, . . . , {\boldsymbol \theta}_{\mathsf k}
\in \mathbb{R}^p$ , and prove universality results both for the training and
test error. Namely, under the proportional asymptotics $n,p\to\infty$, with
$n/p = \Theta(1)$, we prove that the training error depends on the random
features distribution only through its covariance structure. Further, we prove
that the minimum test error over near-empirical risk minimizers enjoys similar
universality properties. In particular, the asymptotics of these quantities can
be computed $-$to leading order$-$ under a simpler model in which the feature
vectors ${\boldsymbol x}_i$ are replaced by Gaussian vectors ${\boldsymbol
g}_i$ with the same covariance. Earlier universality results were limited to
strongly convex learning procedures, or to feature vectors ${\boldsymbol x}_i$
with independent entries. Our results do not make any of these assumptions. Our
assumptions are general enough to include feature vectors ${\boldsymbol x}_i$
that are produced by randomized featurization maps. In particular we explicitly
check the assumptions for certain random features models (computing the output
of a one-layer neural network with random weights) and neural tangent models
(first-order Taylor approximation of two-layer networks).
Related papers
- Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Statistical Learning under Heterogeneous Distribution Shift [71.8393170225794]
Ground-truth predictor is additive $mathbbE[mathbfz mid mathbfx,mathbfy] = f_star(mathbfx) +g_star(mathbfy)$.
arXiv Detail & Related papers (2023-02-27T16:34:21Z) - Sparse Signal Detection in Heteroscedastic Gaussian Sequence Models:
Sharp Minimax Rates [1.0309387309011746]
We study the signal detection problem against sparse alternatives, for known sparsity $s$.
We find minimax upper and lower bounds over the minimax separation radius $epsilon*$ and prove that they are always matching.
Our results reveal new phase transitions regarding the behavior of $epsilon*$ with respect to the level of sparsity, to the $Lt$ metric, and to the heteroscedasticity profile of $Sigma$.
arXiv Detail & Related papers (2022-11-15T23:53:39Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Structure Learning in Graphical Models from Indirect Observations [17.521712510832558]
This paper considers learning of the graphical structure of a $p$-dimensional random vector $X in Rp$ using both parametric and non-parametric methods.
Under mild conditions, we show that our graph-structure estimator can obtain the correct structure.
arXiv Detail & Related papers (2022-05-06T19:24:44Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Generalization error of random features and kernel methods:
hypercontractivity and kernel matrix concentration [19.78800773518545]
We study the use of random features methods in conjunction with ridge regression in the feature space $mathbb RN$.
This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime.
arXiv Detail & Related papers (2021-01-26T06:46:41Z) - Near-Optimal Model Discrimination with Non-Disclosure [19.88145627448243]
We first consider the case of a well-specified linear model with squared loss.
We derive sample complexity of a similar form, even under misspecification.
arXiv Detail & Related papers (2020-12-04T23:52:54Z) - Tree-Projected Gradient Descent for Estimating Gradient-Sparse
Parameters on Graphs [10.846572437131872]
We study estimation of a gradient-sparse parameter vector $boldsymboltheta* in mathbbRp$.
We show that, under suitable restricted strong convexity and smoothness assumptions for the loss, the resulting estimator achieves the squared-error risk $fracs*n log (1+fracps*)$ up to a multiplicative constant that is independent of $G$.
arXiv Detail & Related papers (2020-05-31T20:08:13Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.