A Universal Law of Robustness via Isoperimetry
- URL: http://arxiv.org/abs/2105.12806v1
- Date: Wed, 26 May 2021 19:49:47 GMT
- Title: A Universal Law of Robustness via Isoperimetry
- Authors: S\'ebastien Bubeck, Mark Sellke
- Abstract summary: We show that smooth requires $d$ more parameters than mere, where $d$ is the ambient data dimension.
We prove this universal law of robustness for any smoothly parametrized function class with size weights.
- Score: 1.484852576248587
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Classically, data interpolation with a parametrized model class is possible
as long as the number of parameters is larger than the number of equations to
be satisfied. A puzzling phenomenon in deep learning is that models are trained
with many more parameters than what this classical theory would suggest. We
propose a theoretical explanation for this phenomenon. We prove that for a
broad class of data distributions and model classes, overparametrization is
necessary if one wants to interpolate the data smoothly. Namely we show that
smooth interpolation requires $d$ times more parameters than mere
interpolation, where $d$ is the ambient data dimension. We prove this universal
law of robustness for any smoothly parametrized function class with polynomial
size weights, and any covariate distribution verifying isoperimetry. In the
case of two-layers neural networks and Gaussian covariates, this law was
conjectured in prior work by Bubeck, Li and Nagaraj.
Related papers
- Highly Adaptive Ridge [84.38107748875144]
We propose a regression method that achieves a $n-2/3$ dimension-free L2 convergence rate in the class of right-continuous functions with square-integrable sectional derivatives.
Har is exactly kernel ridge regression with a specific data-adaptive kernel based on a saturated zero-order tensor-product spline basis expansion.
We demonstrate empirical performance better than state-of-the-art algorithms for small datasets in particular.
arXiv Detail & Related papers (2024-10-03T17:06:06Z) - Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup.
We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$.
Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z) - Implicit Manifold Gaussian Process Regression [49.0787777751317]
Gaussian process regression is widely used to provide well-calibrated uncertainty estimates.
It struggles with high-dimensional data because of the implicit low-dimensional manifold upon which the data actually lies.
In this paper we propose a technique capable of inferring implicit structure directly from data (labeled and unlabeled) in a fully differentiable way.
arXiv Detail & Related papers (2023-10-30T09:52:48Z) - Distribution learning via neural differential equations: a nonparametric
statistical perspective [1.4436965372953483]
This work establishes the first general statistical convergence analysis for distribution learning via ODE models trained through likelihood transformations.
We show that the latter can be quantified via the $C1$-metric entropy of the class $mathcal F$.
We then apply this general framework to the setting of $Ck$-smooth target densities, and establish nearly minimax-optimal convergence rates for two relevant velocity field classes $mathcal F$: $Ck$ functions and neural networks.
arXiv Detail & Related papers (2023-09-03T00:21:37Z) - Kernel-based off-policy estimation without overlap: Instance optimality
beyond semiparametric efficiency [53.90687548731265]
We study optimal procedures for estimating a linear functional based on observational data.
For any convex and symmetric function class $mathcalF$, we derive a non-asymptotic local minimax bound on the mean-squared error.
arXiv Detail & Related papers (2023-01-16T02:57:37Z) - A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions.
We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$.
We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z) - Post-mortem on a deep learning contest: a Simpson's paradox and the
complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models.
We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data.
We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z) - Fitting Elephants [0.0]
Modern machine learning (ML) approaches, cf. deep nets (DNNs), generalize well despite interpolating noisy data.
This may be understood via Statistically Interpolation (SCI)
SCI shows that the purely empirical approach can successfully predict.
However data does not provide theoretical insights, and the training data requirements may be prohibitive.
arXiv Detail & Related papers (2021-03-31T05:50:39Z) - Fitting very flexible models: Linear regression with large numbers of
parameters [0.0]
Linear fitting is used to generalize and denoising data.
We discuss how this basis-function fitting is done, with ordinary least squares and extensions thereof.
It is even possible to take the limit of infinite parameters, at which, if the basis and regularization are chosen correctly, the least-squares fit becomes the mean of a process.
arXiv Detail & Related papers (2021-01-15T21:08:34Z) - Sparse Gaussian Processes with Spherical Harmonic Features [14.72311048788194]
We introduce a new class of inter-domain variational Gaussian processes (GP)
Our inference scheme is comparable to variational Fourier features, but it does not suffer from the curse of dimensionality.
Our experiments show that our model is able to fit a regression model for a dataset with 6 million entries two orders of magnitude faster.
arXiv Detail & Related papers (2020-06-30T10:19:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.