Towards Understanding Hierarchical Learning: Benefits of Neural
Representations
- URL: http://arxiv.org/abs/2006.13436v2
- Date: Fri, 5 Mar 2021 15:46:56 GMT
- Title: Towards Understanding Hierarchical Learning: Benefits of Neural
Representations
- Authors: Minshuo Chen, Yu Bai, Jason D. Lee, Tuo Zhao, Huan Wang, Caiming
Xiong, Richard Socher
- Abstract summary: In this work, we demonstrate that intermediate neural representations add more flexibility to neural networks.
We show that neural representation can achieve improved sample complexities compared with the raw input.
Our results characterize when neural representations are beneficial, and may provide a new perspective on why depth is important in deep learning.
- Score: 160.33479656108926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks can empirically perform efficient hierarchical learning,
in which the layers learn useful representations of the data. However, how they
make use of the intermediate representations are not explained by recent
theories that relate them to "shallow learners" such as kernels. In this work,
we demonstrate that intermediate neural representations add more flexibility to
neural networks and can be advantageous over raw inputs. We consider a fixed,
randomly initialized neural network as a representation function fed into
another trainable network. When the trainable network is the quadratic Taylor
model of a wide two-layer network, we show that neural representation can
achieve improved sample complexities compared with the raw input: For learning
a low-rank degree-$p$ polynomial ($p \geq 4$) in $d$ dimension, neural
representation requires only $\tilde{O}(d^{\lceil p/2 \rceil})$ samples, while
the best-known sample complexity upper bound for the raw input is
$\tilde{O}(d^{p-1})$. We contrast our result with a lower bound showing that
neural representations do not improve over the raw input (in the infinite width
limit), when the trainable network is instead a neural tangent kernel. Our
results characterize when neural representations are beneficial, and may
provide a new perspective on why depth is important in deep learning.
Related papers
- LinSATNet: The Positive Linear Satisfiability Neural Networks [116.65291739666303]
This paper studies how to introduce the popular positive linear satisfiability to neural networks.
We propose the first differentiable satisfiability layer based on an extension of the classic Sinkhorn algorithm for jointly encoding multiple sets of marginal distributions.
arXiv Detail & Related papers (2024-07-18T22:05:21Z) - Generative Kaleidoscopic Networks [2.321684718906739]
We utilize this property of neural networks to design a dataset kaleidoscope, termed as Generative Kaleidoscopic Networks'
We observed this phenomenon to various degrees for the other deep learning architectures like CNNs, Transformers & U-Nets.
arXiv Detail & Related papers (2024-02-19T02:48:40Z) - Benign Overfitting for Two-layer ReLU Convolutional Neural Networks [60.19739010031304]
We establish algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise.
We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk.
arXiv Detail & Related papers (2023-03-07T18:59:38Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity
on Pruned Neural Networks [79.74580058178594]
We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function.
We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
arXiv Detail & Related papers (2021-10-12T01:11:07Z) - Does Preprocessing Help Training Over-parameterized Neural Networks? [19.64638346701198]
We propose two novel preprocessing ideas to bypass the $Omega(mnd)$ barrier.
Our results provide theoretical insights for a large number of previously established fast training methods.
arXiv Detail & Related papers (2021-10-09T18:16:23Z) - The Rate of Convergence of Variation-Constrained Deep Neural Networks [35.393855471751756]
We show that a class of variation-constrained neural networks can achieve near-parametric rate $n-1/2+delta$ for an arbitrarily small constant $delta$.
The result indicates that the neural function space needed for approximating smooth functions may not be as large as what is often perceived.
arXiv Detail & Related papers (2021-06-22T21:28:00Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - On Tractable Representations of Binary Neural Networks [23.50970665150779]
We consider the compilation of a binary neural network's decision function into tractable representations such as Ordered Binary Decision Diagrams (OBDDs) and Sentential Decision Diagrams (SDDs)
In experiments, we show that it is feasible to obtain compact representations of neural networks as SDDs.
arXiv Detail & Related papers (2020-04-05T03:21:26Z) - A Deep Conditioning Treatment of Neural Networks [37.192369308257504]
We show that depth improves trainability of neural networks by improving the conditioning of certain kernel matrices of the input data.
We provide versions of the result that hold for training just the top layer of the neural network, as well as for training all layers via the neural tangent kernel.
arXiv Detail & Related papers (2020-02-04T20:21:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.