The merged-staircase property: a necessary and nearly sufficient
condition for SGD learning of sparse functions on two-layer neural networks
- URL: http://arxiv.org/abs/2202.08658v1
- Date: Thu, 17 Feb 2022 13:43:06 GMT
- Title: The merged-staircase property: a necessary and nearly sufficient
condition for SGD learning of sparse functions on two-layer neural networks
- Authors: Emmanuel Abbe, Enric Boix-Adsera, Theodor Misiakiewicz
- Abstract summary: We study SGD-learnability with $O(d)$ sample complexity in a large ambient dimension.
Our main results characterize a hierarchical property, the "merged-staircase property", that is both necessary and nearly sufficient for learning in this setting.
Key tools are a new "dimension-free" dynamics approximation that applies to functions defined on a latent low-dimensional subspace.
- Score: 24.428843425522103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is currently known how to characterize functions that neural networks can
learn with SGD for two extremal parameterizations: neural networks in the
linear regime, and neural networks with no structural constraints. However, for
the main parametrization of interest (non-linear but regular networks) no tight
characterization has yet been achieved, despite significant developments.
We take a step in this direction by considering depth-2 neural networks
trained by SGD in the mean-field regime. We consider functions on binary inputs
that depend on a latent low-dimensional subspace (i.e., small number of
coordinates). This regime is of interest since it is poorly understood how
neural networks routinely tackle high-dimensional datasets and adapt to latent
low-dimensional structure without suffering from the curse of dimensionality.
Accordingly, we study SGD-learnability with $O(d)$ sample complexity in a large
ambient dimension $d$.
Our main results characterize a hierarchical property, the "merged-staircase
property", that is both necessary and nearly sufficient for learning in this
setting.
We further show that non-linear training is necessary: for this class of
functions, linear methods on any feature map (e.g., the NTK) are not capable of
learning efficiently. The key tools are a new "dimension-free" dynamics
approximation result that applies to functions defined on a latent space of
low-dimension, a proof of global convergence based on polynomial identity
testing, and an improvement of lower bounds against linear methods for
non-almost orthogonal functions.
Related papers
- Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression [8.130817534654089]
We consider nonparametric regression by a two-layer neural network trained by gradient descent (GD) or its variant in this paper.
We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $cO(1/n4alpha/(4alpha+1)$.
arXiv Detail & Related papers (2024-07-16T03:38:34Z) - Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions [20.036783417617652]
We investigate the training dynamics of two-layer shallow neural networks trained with gradient-based algorithms.
We show that a simple modification of the idealized single-pass gradient descent training scenario drastically improves its computational efficiency.
Our results highlight the ability of networks to learn relevant structures from data alone without any pre-processing.
arXiv Detail & Related papers (2024-05-24T11:34:31Z) - Nonparametric regression using over-parameterized shallow ReLU neural networks [10.339057554827392]
We show that neural networks can achieve minimax optimal rates of convergence (up to logarithmic factors) for learning functions from certain smooth function classes.
It is assumed that the regression function is from the H"older space with smoothness $alpha(d+3)/2$ or a variation space corresponding to shallow neural networks.
As a byproduct, we derive a new size-independent bound for the local Rademacher complexity of shallow ReLU neural networks.
arXiv Detail & Related papers (2023-06-14T07:42:37Z) - ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models [9.96121040675476]
This manuscript explores how properties of functions learned by neural networks of depth greater than two layers affect predictions.
Our framework considers a family of networks of varying depths that all have the same capacity but different representation costs.
arXiv Detail & Related papers (2023-05-24T22:10:12Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - Exploring Linear Feature Disentanglement For Neural Networks [63.20827189693117]
Non-linear activation functions, e.g., Sigmoid, ReLU, and Tanh, have achieved great success in neural networks (NNs)
Due to the complex non-linear characteristic of samples, the objective of those activation functions is to project samples from their original feature space to a linear separable feature space.
This phenomenon ignites our interest in exploring whether all features need to be transformed by all non-linear functions in current typical NNs.
arXiv Detail & Related papers (2022-03-22T13:09:17Z) - Deep Networks Provably Classify Data on Curves [12.309532551321334]
We study a model problem that uses a deep fully-connected neural network to classify data drawn from two disjoint smooth curves on the unit sphere.
We prove that when (i) the network depth is large to certain properties that set the difficulty of the problem and (ii) the network width and number of samples is intrinsic in the relative depth, randomly-d gradient descent quickly learns to correctly classify all points on the two curves with high probability.
arXiv Detail & Related papers (2021-07-29T20:40:04Z) - Going Beyond Linear RL: Sample Efficient Neural Function Approximation [76.57464214864756]
We study function approximation with two-layer neural networks.
Our results significantly improve upon what can be attained with linear (or eluder dimension) methods.
arXiv Detail & Related papers (2021-07-14T03:03:56Z) - Topological obstructions in neural networks learning [67.8848058842671]
We study global properties of the loss gradient function flow.
We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface.
arXiv Detail & Related papers (2020-12-31T18:53:25Z) - Modeling from Features: a Mean-field Framework for Over-parameterized
Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs)
In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit.
We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z) - Measuring Model Complexity of Neural Networks with Curve Activation
Functions [100.98319505253797]
We propose the linear approximation neural network (LANN) to approximate a given deep model with curve activation function.
We experimentally explore the training process of neural networks and detect overfitting.
We find that the $L1$ and $L2$ regularizations suppress the increase of model complexity.
arXiv Detail & Related papers (2020-06-16T07:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.