A law of robustness for two-layers neural networks
- URL: http://arxiv.org/abs/2009.14444v2
- Date: Tue, 24 Nov 2020 23:59:09 GMT
- Title: A law of robustness for two-layers neural networks
- Authors: S\'ebastien Bubeck and Yuanzhi Li and Dheeraj Nagaraj
- Abstract summary: We make a conjecture that, for any Lipschitz activation function and for most datasets, any two-layers neural network with $k$ neurons that perfectly fit the data must have its Lipschitz constant larger (up to a constant) than $sqrtn/k$ where $n$ is the number of datapoints.
We prove this conjecture when the Lipschitz constant is replaced by an upper bound on it based on the spectral norm of the weight matrix.
- Score: 35.996863024271974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We initiate the study of the inherent tradeoffs between the size of a neural
network and its robustness, as measured by its Lipschitz constant. We make a
precise conjecture that, for any Lipschitz activation function and for most
datasets, any two-layers neural network with $k$ neurons that perfectly fit the
data must have its Lipschitz constant larger (up to a constant) than
$\sqrt{n/k}$ where $n$ is the number of datapoints. In particular, this
conjecture implies that overparametrization is necessary for robustness, since
it means that one needs roughly one neuron per datapoint to ensure a
$O(1)$-Lipschitz network, while mere data fitting of $d$-dimensional data
requires only one neuron per $d$ datapoints. We prove a weaker version of this
conjecture when the Lipschitz constant is replaced by an upper bound on it
based on the spectral norm of the weight matrix. We also prove the conjecture
in the high-dimensional regime $n \approx d$ (which we also refer to as the
undercomplete case, since only $k \leq d$ is relevant here). Finally we prove
the conjecture for polynomial activation functions of degree $p$ when $n
\approx d^p$. We complement these findings with experimental evidence
supporting the conjecture.
Related papers
- Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data.
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Bayesian Inference with Deep Weakly Nonlinear Networks [57.95116787699412]
We show at a physics level of rigor that Bayesian inference with a fully connected neural network is solvable.
We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature.
arXiv Detail & Related papers (2024-05-26T17:08:04Z) - Effective Minkowski Dimension of Deep Nonparametric Regression: Function
Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures.
This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z) - Near-optimal fitting of ellipsoids to random points [68.12685213894112]
A basic problem of fitting an ellipsoid to random points has connections to low-rank matrix decompositions, independent component analysis, and principal component analysis.
We resolve this conjecture up to logarithmic factors by constructing a fitting ellipsoid for some $n = Omega(, d2/mathrmpolylog(d),)$.
Our proof demonstrates feasibility of the least squares construction of Saunderson et al. using a convenient decomposition of a certain non-standard random matrix.
arXiv Detail & Related papers (2022-08-19T18:00:34Z) - On the Multidimensional Random Subset Sum Problem [0.9007371440329465]
In the Random Subset Sum Problem, given $n$ i.i.d. random variables $X_1,..., X_n$, we wish to approximate any point $z in [-1,1]$ as the sum of a subset $X_i_1(z),..., X_i_s(z)$ of them, up to error $varepsilon cdot.
We prove that, in $d$ dimensions, $n = O(d3log frac 1varepsilon cdot
arXiv Detail & Related papers (2022-07-28T08:10:43Z) - A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions.
We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$.
We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z) - An Exponential Improvement on the Memorization Capacity of Deep
Threshold Networks [40.489350374378645]
We prove that $widetildemathcalO(e1/delta2+sqrtn)$ neurons and $widetildemathcalO(fracddelta+n)$ weights are sufficient.
We also prove new lower bounds by connecting in neural networks to the purely geometric problem of separating $n$ points on a sphere using hyperplanes.
arXiv Detail & Related papers (2021-06-14T19:42:32Z) - Fundamental tradeoffs between memorization and robustness in random
features and neural tangent regimes [15.76663241036412]
We prove for a large class of activation functions that, if the model memorizes even a fraction of the training, then its Sobolev-seminorm is lower-bounded.
Experiments reveal for the first time, (iv) a multiple-descent phenomenon in the robustness of the min-norm interpolator.
arXiv Detail & Related papers (2021-06-04T17:52:50Z) - A Law of Robustness for Weight-bounded Neural Networks [37.54604146791085]
Recently, (Bubeck et al., 2020) conjectured that when using two-layer networks with $k$ neurons to fit a generic dataset, the smallest Lipschitz constant is $Omega(sqrtfracnk)$.
In this work we derive a lower bound on the Lipschitz constant for any arbitrary model class with bounded Rademacher complexity.
Our result coincides with that conjectured in (Bubeck et al., 2020) for two-layer networks under the assumption of bounded weights.
arXiv Detail & Related papers (2021-02-16T11:28:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.