Intrinsic dimensionality and generalization properties of the
$\mathcal{R}$-norm inductive bias
- URL: http://arxiv.org/abs/2206.05317v2
- Date: Mon, 26 Jun 2023 12:00:22 GMT
- Title: Intrinsic dimensionality and generalization properties of the
$\mathcal{R}$-norm inductive bias
- Authors: Navid Ardeshir, Daniel Hsu, Clayton Sanford
- Abstract summary: The $mathcalR$-norm is the basis of an inductive bias for two-layer neural networks.
We find that these interpolants are intrinsically multivariate functions, even when there are ridge functions that fit the data.
- Score: 4.37441734515066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the structural and statistical properties of $\mathcal{R}$-norm
minimizing interpolants of datasets labeled by specific target functions. The
$\mathcal{R}$-norm is the basis of an inductive bias for two-layer neural
networks, recently introduced to capture the functional effect of controlling
the size of network weights, independently of the network width. We find that
these interpolants are intrinsically multivariate functions, even when there
are ridge functions that fit the data, and also that the $\mathcal{R}$-norm
inductive bias is not sufficient for achieving statistically optimal
generalization for certain learning problems. Altogether, these results shed
new light on an inductive bias that is connected to practical neural network
training.
Related papers
- Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - From Kernel Methods to Neural Networks: A Unifying Variational
Formulation [25.6264886382888]
We present a unifying regularization functional that depends on an operator and on a generic Radon-domain norm.
Our framework offers guarantees of universal approximation for a broad family of regularization operators or, equivalently, for a wide variety of shallow neural networks.
arXiv Detail & Related papers (2022-06-29T13:13:53Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Correlation Functions in Random Fully Connected Neural Networks at
Finite Width [17.51364577113718]
This article considers fully connected neural networks with Gaussian random weights and biases and $L$ hidden layers.
For bounded non-linearities we give sharp recursion estimates in powers of $1/n$ for the joint correlation functions of the network output and its derivatives.
We find in both cases that the depth-to-width ratio $L/n$ plays the role of an effective network depth, controlling both the scale of fluctuations at individual neurons and the size of inter-neuron correlations.
arXiv Detail & Related papers (2022-04-03T11:57:18Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability.
We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z) - Fundamental tradeoffs between memorization and robustness in random
features and neural tangent regimes [15.76663241036412]
We prove for a large class of activation functions that, if the model memorizes even a fraction of the training, then its Sobolev-seminorm is lower-bounded.
Experiments reveal for the first time, (iv) a multiple-descent phenomenon in the robustness of the min-norm interpolator.
arXiv Detail & Related papers (2021-06-04T17:52:50Z) - The Efficacy of $L_1$ Regularization in Two-Layer Neural Networks [36.753907384994704]
A crucial problem in neural networks is to select the most appropriate number of hidden neurons and obtain tight statistical risk bounds.
We show that $L_1$ regularization can control the generalization error and sparsify the input dimension.
An excessively large number of neurons do not necessarily inflate generalization errors under a suitable regularization.
arXiv Detail & Related papers (2020-10-02T15:23:22Z) - The Interpolation Phase Transition in Neural Networks: Memorization and
Generalization under Lazy Training [10.72393527290646]
We study phenomena in the context of two-layers neural networks in the neural tangent (NT) regime.
We prove that as soon as $Ndgg n$, the test error is well approximated by one of kernel ridge regression with respect to the infinite-width kernel.
The latter is in turn well approximated by the error ridge regression, whereby the regularization parameter is increased by a self-induced' term related to the high-degree components of the activation function.
arXiv Detail & Related papers (2020-07-25T01:51:13Z) - Provably Efficient Neural Estimation of Structural Equation Model: An
Adversarial Approach [144.21892195917758]
We study estimation in a class of generalized Structural equation models (SEMs)
We formulate the linear operator equation as a min-max game, where both players are parameterized by neural networks (NNs), and learn the parameters of these neural networks using a gradient descent.
For the first time we provide a tractable estimation procedure for SEMs based on NNs with provable convergence and without the need for sample splitting.
arXiv Detail & Related papers (2020-07-02T17:55:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.