Learning Hierarchical Polynomials with Three-Layer Neural Networks
- URL: http://arxiv.org/abs/2311.13774v1
- Date: Thu, 23 Nov 2023 02:19:32 GMT
- Title: Learning Hierarchical Polynomials with Three-Layer Neural Networks
- Authors: Zihao Wang, Eshaan Nichani, Jason D. Lee
- Abstract summary: We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks.
For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error.
This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
- Score: 56.71223169861528
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of learning hierarchical polynomials over the standard
Gaussian distribution with three-layer neural networks. We specifically
consider target functions of the form $h = g \circ p$ where $p : \mathbb{R}^d
\rightarrow \mathbb{R}$ is a degree $k$ polynomial and $g: \mathbb{R}
\rightarrow \mathbb{R}$ is a degree $q$ polynomial. This function class
generalizes the single-index model, which corresponds to $k=1$, and is a
natural class of functions possessing an underlying hierarchical structure. Our
main result shows that for a large subclass of degree $k$ polynomials $p$, a
three-layer neural network trained via layerwise gradient descent on the square
loss learns the target $h$ up to vanishing test error in
$\widetilde{\mathcal{O}}(d^k)$ samples and polynomial time. This is a strict
improvement over kernel methods, which require $\widetilde \Theta(d^{kq})$
samples, as well as existing guarantees for two-layer networks, which require
the target function to be low-rank. Our result also generalizes prior works on
three-layer neural networks, which were restricted to the case of $p$ being a
quadratic. When $p$ is indeed a quadratic, we achieve the
information-theoretically optimal sample complexity
$\widetilde{\mathcal{O}}(d^2)$, which is an improvement over prior
work~\citep{nichani2023provable} requiring a sample size of
$\widetilde\Theta(d^4)$. Our proof proceeds by showing that during the initial
stage of training the network performs feature learning to recover the feature
$p$ with $\widetilde{\mathcal{O}}(d^k)$ samples. This work demonstrates the
ability of three-layer neural networks to learn complex features and as a
result, learn a broad class of hierarchical functions.
Related papers
- Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations [40.77319247558742]
We study the computational complexity of learning a target function $f_*:mathbbRdtomathbbR$ with additive structure.
We prove that a large subset of $f_*$ can be efficiently learned by gradient training of a two-layer neural network.
arXiv Detail & Related papers (2024-06-17T17:59:17Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data.
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Neural Networks can Learn Representations with Gradient Descent [68.95262816363288]
In specific regimes, neural networks trained by gradient descent behave like kernel methods.
In practice, it is known that neural networks strongly outperform their associated kernels.
arXiv Detail & Related papers (2022-06-30T09:24:02Z) - Learning (Very) Simple Generative Models Is Hard [45.13248517769758]
We show that no-time algorithm can solve problem even when output coordinates of $mathbbRdtobbRd'$ are one-hidden-layer ReLU networks with $mathrmpoly(d)$ neurons.
Key ingredient in our proof is an ODE-based construction of a compactly supported, piecewise-linear function $f$ with neurally-bounded slopes such that the pushforward of $mathcalN(0,1)$ under $f$ matches all low-degree moments of $mathcal
arXiv Detail & Related papers (2022-05-31T17:59:09Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z) - Sharp Representation Theorems for ReLU Networks with Precise Dependence
on Depth [26.87238691716307]
We prove sharp-free representation results for neural networks with $D$ ReLU layers under square loss.
Our results confirm the prevailing hypothesis that deeper networks are better at representing less smooth functions.
arXiv Detail & Related papers (2020-06-07T05:25:06Z) - On the Modularity of Hypernetworks [103.1147622394852]
We show that for a structured target function, the overall number of trainable parameters in a hypernetwork is smaller by orders of magnitude than the number of trainable parameters of a standard neural network and an embedding method.
arXiv Detail & Related papers (2020-02-23T22:51:52Z) - A Corrective View of Neural Networks: Representation, Memorization and
Learning [26.87238691716307]
We develop a corrective mechanism for neural network approximation.
We show that two-layer neural networks in the random features regime (RF) can memorize arbitrary labels.
We also consider three-layer neural networks and show that the corrective mechanism yields faster representation rates for smooth radial functions.
arXiv Detail & Related papers (2020-02-01T20:51:09Z) - Backward Feature Correction: How Deep Learning Performs Deep
(Hierarchical) Learning [66.05472746340142]
This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective.
We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
arXiv Detail & Related papers (2020-01-13T17:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.