Locality defeats the curse of dimensionality in convolutional
teacher-student scenarios
- URL: http://arxiv.org/abs/2106.08619v1
- Date: Wed, 16 Jun 2021 08:27:31 GMT
- Title: Locality defeats the curse of dimensionality in convolutional
teacher-student scenarios
- Authors: Alessandro Favero, Francesco Cagnetta, Matthieu Wyart
- Abstract summary: We show that locality is key in determining the learning curve exponent $beta$.
We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
- Score: 69.2027612631023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional neural networks perform a local and translationally-invariant
treatment of the data: quantifying which of these two aspects is central to
their success remains a challenge. We study this problem within a
teacher-student framework for kernel regression, using `convolutional' kernels
inspired by the neural tangent kernel of simple convolutional architectures of
given filter size. Using heuristic methods from physics, we find in the
ridgeless case that locality is key in determining the learning curve exponent
$\beta$ (that relates the test error $\epsilon_t\sim P^{-\beta}$ to the size of
the training set $P$), whereas translational invariance is not. In particular,
if the filter size of the teacher $t$ is smaller than that of the student $s$,
$\beta$ is a function of $s$ only and does not depend on the input dimension.
We confirm our predictions on $\beta$ empirically. Theoretically, in some cases
(including when teacher and student are equal) it can be shown that this
prediction is an upper bound on performance. We conclude by proving, using a
natural universality assumption, that performing kernel regression with a ridge
that decreases with the size of the training set leads to similar learning
curve exponents to those we obtain in the ridgeless case.
Related papers
- Bayesian Inference with Deep Weakly Nonlinear Networks [57.95116787699412]
We show at a physics level of rigor that Bayesian inference with a fully connected neural network is solvable.
We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature.
arXiv Detail & Related papers (2024-05-26T17:08:04Z) - Effective Minkowski Dimension of Deep Nonparametric Regression: Function
Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures.
This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z) - Spatially heterogeneous learning by a deep student machine [0.0]
Deep neural networks (DNN) with a huge number of adjustable parameters remain largely black boxes.
We study supervised learning by a DNN of width $N$ and depth $L$ consisting of $NL$ perceptrons with $c$ inputs by a statistical mechanics approach called the teacher-student setting.
We show that the problem becomes exactly solvable in what we call as 'dense limit': $N gg c gg 1$ and $M gg 1$ with fixed $alpha=M/c$ using the replica method developed in (H. Yoshino, (
arXiv Detail & Related papers (2023-02-15T01:09:03Z) - The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich
Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$.
We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z) - Neural Networks can Learn Representations with Gradient Descent [68.95262816363288]
In specific regimes, neural networks trained by gradient descent behave like kernel methods.
In practice, it is known that neural networks strongly outperform their associated kernels.
arXiv Detail & Related papers (2022-06-30T09:24:02Z) - Failure and success of the spectral bias prediction for Kernel Ridge
Regression: the case of low-dimensional data [0.28647133890966986]
In some regimes, they predict that the method has a spectral bias': decomposing the true function $f*$ on the eigenbasis of the kernel.
This prediction works very well on benchmark data sets such as images, yet the assumptions these approaches make on the data are never satisfied in practice.
arXiv Detail & Related papers (2022-02-07T16:48:14Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Fundamental tradeoffs between memorization and robustness in random
features and neural tangent regimes [15.76663241036412]
We prove for a large class of activation functions that, if the model memorizes even a fraction of the training, then its Sobolev-seminorm is lower-bounded.
Experiments reveal for the first time, (iv) a multiple-descent phenomenon in the robustness of the min-norm interpolator.
arXiv Detail & Related papers (2021-06-04T17:52:50Z) - The Interpolation Phase Transition in Neural Networks: Memorization and
Generalization under Lazy Training [10.72393527290646]
We study phenomena in the context of two-layers neural networks in the neural tangent (NT) regime.
We prove that as soon as $Ndgg n$, the test error is well approximated by one of kernel ridge regression with respect to the infinite-width kernel.
The latter is in turn well approximated by the error ridge regression, whereby the regularization parameter is increased by a self-induced' term related to the high-degree components of the activation function.
arXiv Detail & Related papers (2020-07-25T01:51:13Z) - Optimization and Generalization of Shallow Neural Networks with
Quadratic Activation Functions [11.70706646606773]
We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks.
We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width.
We show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error.
arXiv Detail & Related papers (2020-06-27T22:13:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.