The staircase property: How hierarchical structure can guide deep
learning
- URL: http://arxiv.org/abs/2108.10573v1
- Date: Tue, 24 Aug 2021 08:19:05 GMT
- Title: The staircase property: How hierarchical structure can guide deep
learning
- Authors: Emmanuel Abbe, Enric Boix-Adsera, Matthew Brennan, Guy Bresler,
Dheeraj Nagaraj
- Abstract summary: This paper identifies a structural property of data distributions that enables deep neural networks to learn hierarchically.
We prove that functions satisfying this property can be learned in time using layerwise coordinate descent on regular neural networks.
- Score: 38.713566366330326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper identifies a structural property of data distributions that
enables deep neural networks to learn hierarchically. We define the "staircase"
property for functions over the Boolean hypercube, which posits that high-order
Fourier coefficients are reachable from lower-order Fourier coefficients along
increasing chains. We prove that functions satisfying this property can be
learned in polynomial time using layerwise stochastic coordinate descent on
regular neural networks -- a class of network architectures and initializations
that have homogeneity properties. Our analysis shows that for such staircase
functions and neural networks, the gradient-based algorithm learns high-level
features by greedily combining lower-level features along the depth of the
network. We further back our theoretical results with experiments showing that
staircase functions are also learnable by more standard ResNet architectures
with stochastic gradient descent. Both the theoretical and experimental results
support the fact that staircase properties have a role to play in understanding
the capabilities of gradient-based learning on regular networks, in contrast to
general polynomial-size networks that can emulate any SQ or PAC algorithms as
recently shown.
Related papers
- Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions [20.036783417617652]
We investigate the training dynamics of two-layer shallow neural networks trained with gradient-based algorithms.
We show that a simple modification of the idealized single-pass gradient descent training scenario drastically improves its computational efficiency.
Our results highlight the ability of networks to learn relevant structures from data alone without any pre-processing.
arXiv Detail & Related papers (2024-05-24T11:34:31Z) - Asymptotics of Learning with Deep Structured (Random) Features [9.366617422860543]
For a large class of feature maps we provide a tight characterisation of the test error associated with learning the readout layer.
In some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.
arXiv Detail & Related papers (2024-02-21T18:35:27Z) - Feature Network Methods in Machine Learning and Applications [0.0]
A machine learning (ML) feature network is a graph that connects ML features in learning tasks based on their similarity.
We provide an example of a deep tree-structured feature network, where hierarchical connections are formed through feature clustering and feed-forward learning.
arXiv Detail & Related papers (2024-01-10T01:57:12Z) - Data Topology-Dependent Upper Bounds of Neural Network Widths [52.58441144171022]
We first show that a three-layer neural network can be designed to approximate an indicator function over a compact set.
This is then extended to a simplicial complex, deriving width upper bounds based on its topological structure.
We prove the universal approximation property of three-layer ReLU networks using our topological approach.
arXiv Detail & Related papers (2023-05-25T14:17:15Z) - ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models [9.96121040675476]
This manuscript explores how properties of functions learned by neural networks of depth greater than two layers affect predictions.
Our framework considers a family of networks of varying depths that all have the same capacity but different representation costs.
arXiv Detail & Related papers (2023-05-24T22:10:12Z) - Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural
Networks [49.808194368781095]
We show that three-layer neural networks have provably richer feature learning capabilities than two-layer networks.
This work makes progress towards understanding the provable benefit of three-layer neural networks over two-layer networks in the feature learning regime.
arXiv Detail & Related papers (2023-05-11T17:19:30Z) - Rank Diminishing in Deep Neural Networks [71.03777954670323]
Rank of neural networks measures information flowing across layers.
It is an instance of a key structural condition that applies across broad domains of machine learning.
For neural networks, however, the intrinsic mechanism that yields low-rank structures remains vague and unclear.
arXiv Detail & Related papers (2022-06-13T12:03:32Z) - Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations.
This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z) - Towards Lower Bounds on the Depth of ReLU Neural Networks [7.355977594790584]
We investigate whether the class of exactly representable functions strictly increases by adding more layers.
We settle an old conjecture about piecewise linear functions by Wang and Sun (2005) in the affirmative.
We present upper bounds on the sizes of neural networks required to represent functions with logarithmic depth.
arXiv Detail & Related papers (2021-05-31T09:49:14Z) - Neural networks adapting to datasets: learning network size and topology [77.34726150561087]
We introduce a flexible setup allowing for a neural network to learn both its size and topology during the course of a gradient-based training.
The resulting network has the structure of a graph tailored to the particular learning task and dataset.
arXiv Detail & Related papers (2020-06-22T12:46:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.