Why Are Convolutional Nets More Sample-Efficient than Fully-Connected
Nets?
- URL: http://arxiv.org/abs/2010.08515v2
- Date: Tue, 4 May 2021 17:54:15 GMT
- Title: Why Are Convolutional Nets More Sample-Efficient than Fully-Connected
Nets?
- Authors: Zhiyuan Li, Yi Zhang, Sanjeev Arora
- Abstract summary: We show a natural task on which a provable sample complexity gap can be shown, for standard training algorithms.
We demonstrate a single target function, learning which on all possible distributions leads to an $O(1)$ vs $Omega(d2/varepsilon)$ gap.
Similar results are achieved for $ell$ regression and adaptive training algorithms, e.g. Adam and AdaGrad.
- Score: 33.51250867983687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional neural networks often dominate fully-connected counterparts in
generalization performance, especially on image classification tasks. This is
often explained in terms of 'better inductive bias'. However, this has not been
made mathematically rigorous, and the hurdle is that the fully connected net
can always simulate the convolutional net (for a fixed task). Thus the training
algorithm plays a role. The current work describes a natural task on which a
provable sample complexity gap can be shown, for standard training algorithms.
We construct a single natural distribution on $\mathbb{R}^d\times\{\pm 1\}$ on
which any orthogonal-invariant algorithm (i.e. fully-connected networks trained
with most gradient-based methods from gaussian initialization) requires
$\Omega(d^2)$ samples to generalize while $O(1)$ samples suffice for
convolutional architectures. Furthermore, we demonstrate a single target
function, learning which on all possible distributions leads to an $O(1)$ vs
$\Omega(d^2/\varepsilon)$ gap. The proof relies on the fact that SGD on
fully-connected network is orthogonal equivariant. Similar results are achieved
for $\ell_2$ regression and adaptive training algorithms, e.g. Adam and
AdaGrad, which are only permutation equivariant.
Related papers
- Replicable Uniformity Testing [1.5883812630616523]
This work revisits uniformity testing under the framework of algorithmic replicability.
We obtain a replicable tester using only $tildeO(sqrtn varepsilon-2 rho-1)$ samples.
arXiv Detail & Related papers (2024-10-12T02:55:17Z) - Learning Hierarchical Polynomials with Three-Layer Neural Networks [56.71223169861528]
We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks.
For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error.
This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
arXiv Detail & Related papers (2023-11-23T02:19:32Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Stochastic Approximation Approaches to Group Distributionally Robust
Optimization [96.26317627118912]
Group distributionally robust optimization (GDRO)
Online learning techniques to reduce the number of samples required in each round from $m$ to $1$, keeping the same sample.
A novel formulation of weighted GDRO, which allows us to derive distribution-dependent convergence rates.
arXiv Detail & Related papers (2023-02-18T09:24:15Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - A Feedforward Unitary Equivariant Neural Network [3.6220250022337335]
We devise a new type of feedforward neural network.
It is equivariant with respect to the unitary group $U(n)$.
The input and output can be vectors in $mathbbCn$ with arbitrary dimension $n$.
arXiv Detail & Related papers (2022-08-25T15:05:02Z) - Neural Networks can Learn Representations with Gradient Descent [68.95262816363288]
In specific regimes, neural networks trained by gradient descent behave like kernel methods.
In practice, it is known that neural networks strongly outperform their associated kernels.
arXiv Detail & Related papers (2022-06-30T09:24:02Z) - Distributed Sparse Feature Selection in Communication-Restricted
Networks [6.9257380648471765]
We propose and theoretically analyze a new distributed scheme for sparse linear regression and feature selection.
In order to infer the causal dimensions from the whole dataset, we propose a simple, yet effective method for information sharing in the network.
arXiv Detail & Related papers (2021-11-02T05:02:24Z) - Convergence and Sample Complexity of SGD in GANs [15.25030172685628]
We provide convergence guarantees on training Generative Adversarial Networks (GANs) via SGD.
We consider learning a target distribution modeled by a 1-layer Generator network with a non-linear activation function.
Our results apply to a broad class of non-linear activation functions $phi$, including ReLUs and is enabled by a connection with truncated statistics.
arXiv Detail & Related papers (2020-12-01T18:50:38Z) - Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.