Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs
- URL: http://arxiv.org/abs/2403.15707v1
- Date: Sat, 23 Mar 2024 03:57:28 GMT
- Title: Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs
- Authors: Aakash Lahoti, Stefani Karp, Ezra Winston, Aarti Singh, Yuanzhi Li,
- Abstract summary: Vision tasks are characterized by the properties of locality and translation invariance.
The superior performance of convolutional neural networks (CNNs) on these tasks is widely attributed to the inductive bias of locality and weight sharing baked into their architecture.
Existing attempts to quantify the statistical benefits of these biases in CNNs over locally connected neural networks (LCNs) and fully connected neural networks (FCNs) fall into one of the following categories.
- Score: 42.551773746803946
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision tasks are characterized by the properties of locality and translation invariance. The superior performance of convolutional neural networks (CNNs) on these tasks is widely attributed to the inductive bias of locality and weight sharing baked into their architecture. Existing attempts to quantify the statistical benefits of these biases in CNNs over locally connected convolutional neural networks (LCNs) and fully connected neural networks (FCNs) fall into one of the following categories: either they disregard the optimizer and only provide uniform convergence upper bounds with no separating lower bounds, or they consider simplistic tasks that do not truly mirror the locality and translation invariance as found in real-world vision tasks. To address these deficiencies, we introduce the Dynamic Signal Distribution (DSD) classification task that models an image as consisting of $k$ patches, each of dimension $d$, and the label is determined by a $d$-sparse signal vector that can freely appear in any one of the $k$ patches. On this task, for any orthogonally equivariant algorithm like gradient descent, we prove that CNNs require $\tilde{O}(k+d)$ samples, whereas LCNs require $\Omega(kd)$ samples, establishing the statistical advantages of weight sharing in translation invariant tasks. Furthermore, LCNs need $\tilde{O}(k(k+d))$ samples, compared to $\Omega(k^2d)$ samples for FCNs, showcasing the benefits of locality in local tasks. Additionally, we develop information theoretic tools for analyzing randomized algorithms, which may be of interest for statistical research.
Related papers
- Fixed Points of Deep Neural Networks: Emergence, Stability, and Applications [0.0]
We present results on the formation and stability of a family of fixed points of deep neural networks (DNNs)
We demonstrate examples of applications of such networks in supervised, semi-supervised and unsupervised learning.
arXiv Detail & Related papers (2025-01-07T23:23:26Z) - SGD Finds then Tunes Features in Two-Layer Neural Networks with
near-Optimal Sample Complexity: A Case Study in the XOR problem [1.3597551064547502]
We consider the optimization process of minibatch descent gradient (SGD) on a 2-layer neural network with data separated by a quadratic ground truth function.
We prove that with data drawn from the $d$-dimensional Boolean hypercube labeled by the quadratic XOR'' function $y = -x_ix_j$, it is possible to train to a population error $o(1)$ with $d :textpolylog(d)$ samples.
arXiv Detail & Related papers (2023-09-26T17:57:44Z) - Theoretical Analysis of Inductive Biases in Deep Convolutional Networks [16.41952363194339]
We provide a theoretical analysis of the inductive biases in convolutional neural networks (CNNs)
We compare the performance of CNNs, locally-connected networks (LCNs), and fully-connected networks (FCNs) on a simple regression task.
We prove that LCNs require $Omega(d)$ samples while CNNs need only $widetildemathcalO(log2d)$ samples, highlighting the critical role of weight sharing.
arXiv Detail & Related papers (2023-05-15T07:40:07Z) - The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich
Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$.
We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Decentralized Sparse Linear Regression via Gradient-Tracking: Linear Convergence and Statistical Guarantees [23.256961881716595]
We study a sparse linear regression over a network of agents, modeled as an undirected graph and no server node.
We analyze the convergence rate and statistical guarantees of a distributed projected gradient tracking-based algorithm.
arXiv Detail & Related papers (2022-01-21T01:26:08Z) - Distributed Sparse Feature Selection in Communication-Restricted
Networks [6.9257380648471765]
We propose and theoretically analyze a new distributed scheme for sparse linear regression and feature selection.
In order to infer the causal dimensions from the whole dataset, we propose a simple, yet effective method for information sharing in the network.
arXiv Detail & Related papers (2021-11-02T05:02:24Z) - A Biased Graph Neural Network Sampler with Near-Optimal Regret [57.70126763759996]
Graph neural networks (GNN) have emerged as a vehicle for applying deep network architectures to graph and relational data.
In this paper, we build upon existing work and treat GNN neighbor sampling as a multi-armed bandit problem.
We introduce a newly-designed reward function that introduces some degree of bias designed to reduce variance and avoid unstable, possibly-unbounded payouts.
arXiv Detail & Related papers (2021-03-01T15:55:58Z) - On the Global Convergence of Training Deep Linear ResNets [104.76256863926629]
We study the convergence of gradient descent (GD) and gradient descent (SGD) for training $L$-hidden-layer linear residual networks (ResNets)
We prove that for training deep residual networks with certain linear transformations at input and output layers, both GD and SGD can converge to the global minimum of the training loss.
arXiv Detail & Related papers (2020-03-02T18:34:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.