Related papers: The staircase property: How hierarchical structure can guide deep learning

The staircase property: How hierarchical structure can guide deep learning

URL: http://arxiv.org/abs/2108.10573v1
Date: Tue, 24 Aug 2021 08:19:05 GMT
Title: The staircase property: How hierarchical structure can guide deep learning
Authors: Emmanuel Abbe, Enric Boix-Adsera, Matthew Brennan, Guy Bresler, Dheeraj Nagaraj
Abstract summary: This paper identifies a structural property of data distributions that enables deep neural networks to learn hierarchically. We prove that functions satisfying this property can be learned in time using layerwise coordinate descent on regular neural networks.
Score: 38.713566366330326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper identifies a structural property of data distributions that enables deep neural networks to learn hierarchically. We define the "staircase" property for functions over the Boolean hypercube, which posits that high-order Fourier coefficients are reachable from lower-order Fourier coefficients along increasing chains. We prove that functions satisfying this property can be learned in polynomial time using layerwise stochastic coordinate descent on regular neural networks -- a class of network architectures and initializations that have homogeneity properties. Our analysis shows that for such staircase functions and neural networks, the gradient-based algorithm learns high-level features by greedily combining lower-level features along the depth of the network. We further back our theoretical results with experiments showing that staircase functions are also learnable by more standard ResNet architectures with stochastic gradient descent. Both the theoretical and experimental results support the fact that staircase properties have a role to play in understanding the capabilities of gradient-based learning on regular networks, in contrast to general polynomial-size networks that can emulate any SQ or PAC algorithms as recently shown.

Related papers

Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework. We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z)
Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions [20.036783417617652]
We investigate the training dynamics of two-layer shallow neural networks trained with gradient-based algorithms. We show that a simple modification of the idealized single-pass gradient descent training scenario drastically improves its computational efficiency. Our results highlight the ability of networks to learn relevant structures from data alone without any pre-processing.
arXiv Detail & Related papers (2024-05-24T11:34:31Z)
Asymptotics of Learning with Deep Structured (Random) Features [9.366617422860543]
For a large class of feature maps we provide a tight characterisation of the test error associated with learning the readout layer. In some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.
arXiv Detail & Related papers (2024-02-21T18:35:27Z)
Feature Network Methods in Machine Learning and Applications [0.0]
A machine learning (ML) feature network is a graph that connects ML features in learning tasks based on their similarity. We provide an example of a deep tree-structured feature network, where hierarchical connections are formed through feature clustering and feed-forward learning.
arXiv Detail & Related papers (2024-01-10T01:57:12Z)
A topological description of loss surfaces based on Betti Numbers [8.539445673580252]
We provide a topological measure to evaluate loss complexity in the case of multilayer neural networks. We find that certain variations in the loss function or model architecture, such as adding an $ell$ regularization term or skip connections in a feedforward network, do not affect loss in specific cases.
arXiv Detail & Related papers (2024-01-08T11:20:04Z)
Data Topology-Dependent Upper Bounds of Neural Network Widths [52.58441144171022]
We first show that a three-layer neural network can be designed to approximate an indicator function over a compact set. This is then extended to a simplicial complex, deriving width upper bounds based on its topological structure. We prove the universal approximation property of three-layer ReLU networks using our topological approach.
arXiv Detail & Related papers (2023-05-25T14:17:15Z)
ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models [9.96121040675476]
This manuscript explores how properties of functions learned by neural networks of depth greater than two layers affect predictions. Our framework considers a family of networks of varying depths that all have the same capacity but different representation costs.
arXiv Detail & Related papers (2023-05-24T22:10:12Z)
Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks [49.808194368781095]
We show that three-layer neural networks have provably richer feature learning capabilities than two-layer networks. This work makes progress towards understanding the provable benefit of three-layer neural networks over two-layer networks in the feature learning regime.
arXiv Detail & Related papers (2023-05-11T17:19:30Z)
Rank Diminishing in Deep Neural Networks [71.03777954670323]
Rank of neural networks measures information flowing across layers. It is an instance of a key structural condition that applies across broad domains of machine learning. For neural networks, however, the intrinsic mechanism that yields low-rank structures remains vague and unclear.
arXiv Detail & Related papers (2022-06-13T12:03:32Z)
Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations. This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z)
Towards Lower Bounds on the Depth of ReLU Neural Networks [7.355977594790584]
We investigate whether the class of exactly representable functions strictly increases by adding more layers. We settle an old conjecture about piecewise linear functions by Wang and Sun (2005) in the affirmative. We present upper bounds on the sizes of neural networks required to represent functions with logarithmic depth.
arXiv Detail & Related papers (2021-05-31T09:49:14Z)
Neural networks adapting to datasets: learning network size and topology [77.34726150561087]
We introduce a flexible setup allowing for a neural network to learn both its size and topology during the course of a gradient-based training. The resulting network has the structure of a graph tailored to the particular learning task and dataset.
arXiv Detail & Related papers (2020-06-22T12:46:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.