Backward Feature Correction: How Deep Learning Performs Deep
(Hierarchical) Learning
- URL: http://arxiv.org/abs/2001.04413v6
- Date: Fri, 7 Jul 2023 06:12:32 GMT
- Title: Backward Feature Correction: How Deep Learning Performs Deep
(Hierarchical) Learning
- Authors: Zeyuan Allen-Zhu and Yuanzhi Li
- Abstract summary: This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective.
We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
- Score: 66.05472746340142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning is also known as hierarchical learning, where the learner
_learns_ to represent a complicated target function by decomposing it into a
sequence of simpler functions to reduce sample and time complexity. This paper
formally analyzes how multi-layer neural networks can perform such hierarchical
learning _efficiently_ and _automatically_ by SGD on the training objective.
On the conceptual side, we present a theoretical characterizations of how
certain types of deep (i.e. super-constant layer) neural networks can still be
sample and time efficiently trained on some hierarchical tasks, when no
existing algorithm (including layerwise training, kernel method, etc) is known
to be efficient. We establish a new principle called "backward feature
correction", where the errors in the lower-level features can be automatically
corrected when training together with the higher-level layers. We believe this
is a key behind how deep learning is performing deep (hierarchical) learning,
as opposed to layerwise learning or simulating some non-hierarchical method.
On the technical side, we show for every input dimension $d > 0$, there is a
concept class of degree $\omega(1)$ multi-variate polynomials so that, using
$\omega(1)$-layer neural networks as learners, SGD can learn any function from
this class in $\mathsf{poly}(d)$ time to any $\frac{1}{\mathsf{poly}(d)}$
error, through learning to represent it as a composition of $\omega(1)$ layers
of quadratic functions using "backward feature correction." In contrast, we do
not know any other simpler algorithm (including layerwise training, applying
kernel method sequentially, training a two-layer network, etc) that can learn
this concept class in $\mathsf{poly}(d)$ time even to any $d^{-0.01}$ error. As
a side result, we prove $d^{\omega(1)}$ lower bounds for several
non-hierarchical learners, including any kernel methods.
Related papers
- Learning Hierarchical Polynomials of Multiple Nonlinear Features with Three-Layer Networks [46.190882811878744]
In deep learning theory, a critical question is to understand how neural networks learn hierarchical features.
In this work, we study the learning of hierarchicals of textitmultiple nonlinear features using three-layer neural networks.
arXiv Detail & Related papers (2024-11-26T08:14:48Z) - Learning Hierarchical Polynomials with Three-Layer Neural Networks [56.71223169861528]
We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks.
For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error.
This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
arXiv Detail & Related papers (2023-11-23T02:19:32Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Understanding Deep Neural Function Approximation in Reinforcement
Learning via $\epsilon$-Greedy Exploration [53.90873926758026]
This paper provides a theoretical study of deep neural function approximation in reinforcement learning (RL)
We focus on the value based algorithm with the $epsilon$-greedy exploration via deep (and two-layer) neural networks endowed by Besov (and Barron) function spaces.
Our analysis reformulates the temporal difference error in an $L2(mathrmdmu)$-integrable space over a certain averaged measure $mu$, and transforms it to a generalization problem under the non-iid setting.
arXiv Detail & Related papers (2022-09-15T15:42:47Z) - Training Overparametrized Neural Networks in Sublinear Time [14.918404733024332]
Deep learning comes at a tremendous computational and energy cost.
We present a new and a subset of binary neural networks, as a small subset of search trees, where each corresponds to a subset of search trees (Ds)
We believe this view would have further applications in analysis analysis of deep networks (Ds)
arXiv Detail & Related papers (2022-08-09T02:29:42Z) - Neural Networks can Learn Representations with Gradient Descent [68.95262816363288]
In specific regimes, neural networks trained by gradient descent behave like kernel methods.
In practice, it is known that neural networks strongly outperform their associated kernels.
arXiv Detail & Related papers (2022-06-30T09:24:02Z) - Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.