Fundamental tradeoffs between memorization and robustness in random
features and neural tangent regimes
- URL: http://arxiv.org/abs/2106.02630v1
- Date: Fri, 4 Jun 2021 17:52:50 GMT
- Title: Fundamental tradeoffs between memorization and robustness in random
features and neural tangent regimes
- Authors: Elvis Dohmatob
- Abstract summary: We prove for a large class of activation functions that, if the model memorizes even a fraction of the training, then its Sobolev-seminorm is lower-bounded.
Experiments reveal for the first time, (iv) a multiple-descent phenomenon in the robustness of the min-norm interpolator.
- Score: 15.76663241036412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies the (non)robustness of two-layer neural networks in various
high-dimensional linearized regimes. We establish fundamental trade-offs
between memorization and robustness, as measured by the Sobolev-seminorm of the
model w.r.t the data distribution, i.e the square root of the average squared
$L_2$-norm of the gradients of the model w.r.t the its input. More precisely,
if $n$ is the number of training examples, $d$ is the input dimension, and $k$
is the number of hidden neurons in a two-layer neural network, we prove for a
large class of activation functions that, if the model memorizes even a
fraction of the training, then its Sobolev-seminorm is lower-bounded by (i)
$\sqrt{n}$ in case of infinite-width random features (RF) or neural tangent
kernel (NTK) with $d \gtrsim n$; (ii) $\sqrt{n}$ in case of finite-width RF
with proportionate scaling of $d$ and $k$; and (iii) $\sqrt{n/k}$ in case of
finite-width NTK with proportionate scaling of $d$ and $k$. Moreover, all of
these lower-bounds are tight: they are attained by the min-norm / least-squares
interpolator (when $n$, $d$, and $k$ are in the appropriate interpolating
regime). All our results hold as soon as data is log-concave isotropic, and
there is label-noise, i.e the target variable is not a deterministic function
of the data / features. We empirically validate our theoretical results with
experiments. Accidentally, these experiments also reveal for the first time,
(iv) a multiple-descent phenomenon in the robustness of the min-norm
interpolator.
Related papers
- Bayesian Inference with Deep Weakly Nonlinear Networks [57.95116787699412]
We show at a physics level of rigor that Bayesian inference with a fully connected neural network is solvable.
We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature.
arXiv Detail & Related papers (2024-05-26T17:08:04Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - SGD Finds then Tunes Features in Two-Layer Neural Networks with
near-Optimal Sample Complexity: A Case Study in the XOR problem [1.3597551064547502]
We consider the optimization process of minibatch descent gradient (SGD) on a 2-layer neural network with data separated by a quadratic ground truth function.
We prove that with data drawn from the $d$-dimensional Boolean hypercube labeled by the quadratic XOR'' function $y = -x_ix_j$, it is possible to train to a population error $o(1)$ with $d :textpolylog(d)$ samples.
arXiv Detail & Related papers (2023-09-26T17:57:44Z) - Effective Minkowski Dimension of Deep Nonparametric Regression: Function
Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures.
This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z) - The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich
Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$.
We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z) - Correlation Functions in Random Fully Connected Neural Networks at
Finite Width [17.51364577113718]
This article considers fully connected neural networks with Gaussian random weights and biases and $L$ hidden layers.
For bounded non-linearities we give sharp recursion estimates in powers of $1/n$ for the joint correlation functions of the network output and its derivatives.
We find in both cases that the depth-to-width ratio $L/n$ plays the role of an effective network depth, controlling both the scale of fluctuations at individual neurons and the size of inter-neuron correlations.
arXiv Detail & Related papers (2022-04-03T11:57:18Z) - The Rate of Convergence of Variation-Constrained Deep Neural Networks [35.393855471751756]
We show that a class of variation-constrained neural networks can achieve near-parametric rate $n-1/2+delta$ for an arbitrarily small constant $delta$.
The result indicates that the neural function space needed for approximating smooth functions may not be as large as what is often perceived.
arXiv Detail & Related papers (2021-06-22T21:28:00Z) - Locality defeats the curse of dimensionality in convolutional
teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$.
We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - The Interpolation Phase Transition in Neural Networks: Memorization and
Generalization under Lazy Training [10.72393527290646]
We study phenomena in the context of two-layers neural networks in the neural tangent (NT) regime.
We prove that as soon as $Ndgg n$, the test error is well approximated by one of kernel ridge regression with respect to the infinite-width kernel.
The latter is in turn well approximated by the error ridge regression, whereby the regularization parameter is increased by a self-induced' term related to the high-degree components of the activation function.
arXiv Detail & Related papers (2020-07-25T01:51:13Z) - A Neural Scaling Law from the Dimension of the Data Manifold [8.656787568717252]
When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law $L propto N-alpha$ in the number of network parameters $N$.
The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$.
This simple theory predicts that the scaling exponents $alpha approx 4/d$ for cross-entropy and mean-squared error losses.
arXiv Detail & Related papers (2020-04-22T19:16:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.