Neural Networks can Learn Representations with Gradient Descent
- URL: http://arxiv.org/abs/2206.15144v1
- Date: Thu, 30 Jun 2022 09:24:02 GMT
- Title: Neural Networks can Learn Representations with Gradient Descent
- Authors: Alex Damian, Jason D. Lee, Mahdi Soltanolkotabi
- Abstract summary: In specific regimes, neural networks trained by gradient descent behave like kernel methods.
In practice, it is known that neural networks strongly outperform their associated kernels.
- Score: 68.95262816363288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Significant theoretical work has established that in specific regimes, neural
networks trained by gradient descent behave like kernel methods. However, in
practice, it is known that neural networks strongly outperform their associated
kernels. In this work, we explain this gap by demonstrating that there is a
large class of functions which cannot be efficiently learned by kernel methods
but can be easily learned with gradient descent on a two layer neural network
outside the kernel regime by learning representations that are relevant to the
target task. We also demonstrate that these representations allow for efficient
transfer learning, which is impossible in the kernel regime.
Specifically, we consider the problem of learning polynomials which depend on
only a few relevant directions, i.e. of the form $f^\star(x) = g(Ux)$ where $U:
\R^d \to \R^r$ with $d \gg r$. When the degree of $f^\star$ is $p$, it is known
that $n \asymp d^p$ samples are necessary to learn $f^\star$ in the kernel
regime. Our primary result is that gradient descent learns a representation of
the data which depends only on the directions relevant to $f^\star$. This
results in an improved sample complexity of $n\asymp d^2 r + dr^p$.
Furthermore, in a transfer learning setup where the data distributions in the
source and target domain share the same representation $U$ but have different
polynomial heads we show that a popular heuristic for transfer learning has a
target sample complexity independent of $d$.
Related papers
- Learning Hierarchical Polynomials of Multiple Nonlinear Features with Three-Layer Networks [46.190882811878744]
In deep learning theory, a critical question is to understand how neural networks learn hierarchical features.
In this work, we study the learning of hierarchicals of textitmultiple nonlinear features using three-layer neural networks.
arXiv Detail & Related papers (2024-11-26T08:14:48Z) - Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations [40.77319247558742]
We study the computational complexity of learning a target function $f_*:mathbbRdtomathbbR$ with additive structure.
We prove that a large subset of $f_*$ can be efficiently learned by gradient training of a two-layer neural network.
arXiv Detail & Related papers (2024-06-17T17:59:17Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data.
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Learning Hierarchical Polynomials with Three-Layer Neural Networks [56.71223169861528]
We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks.
For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error.
This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
arXiv Detail & Related papers (2023-11-23T02:19:32Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z) - Backward Feature Correction: How Deep Learning Performs Deep
(Hierarchical) Learning [66.05472746340142]
This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective.
We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
arXiv Detail & Related papers (2020-01-13T17:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.