A Corrective View of Neural Networks: Representation, Memorization and
Learning
- URL: http://arxiv.org/abs/2002.00274v2
- Date: Sat, 20 Jun 2020 02:37:48 GMT
- Title: A Corrective View of Neural Networks: Representation, Memorization and
Learning
- Authors: Guy Bresler and Dheeraj Nagaraj
- Abstract summary: We develop a corrective mechanism for neural network approximation.
We show that two-layer neural networks in the random features regime (RF) can memorize arbitrary labels.
We also consider three-layer neural networks and show that the corrective mechanism yields faster representation rates for smooth radial functions.
- Score: 26.87238691716307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We develop a corrective mechanism for neural network approximation: the total
available non-linear units are divided into multiple groups and the first group
approximates the function under consideration, the second group approximates
the error in approximation produced by the first group and corrects it, the
third group approximates the error produced by the first and second groups
together and so on. This technique yields several new representation and
learning results for neural networks. First, we show that two-layer neural
networks in the random features regime (RF) can memorize arbitrary labels for
arbitrary points under under Euclidean distance separation condition using
$\tilde{O}(n)$ ReLUs which is optimal in $n$ up to logarithmic factors. Next,
we give a powerful representation result for two-layer neural networks with
ReLUs and smoothed ReLUs which can achieve a squared error of at most
$\epsilon$ with $O(C(a,d)\epsilon^{-1/(a+1)})$ for $a \in \mathbb{N}\cup\{0\}$
when the function is smooth enough (roughly when it has $\Theta(ad)$ bounded
derivatives). In certain cases $d$ can be replaced with effective dimension $q
\ll d$. Previous results of this type implement Taylor series approximation
using deep architectures. We also consider three-layer neural networks and show
that the corrective mechanism yields faster representation rates for smooth
radial functions. Lastly, we obtain the first $O(\mathrm{subpoly}(1/\epsilon))$
upper bound on the number of neurons required for a two layer network to learn
low degree polynomials up to squared error $\epsilon$ via gradient descent.
Even though deep networks can express these polynomials with
$O(\mathrm{polylog}(1/\epsilon))$ neurons, the best learning bounds on this
problem require $\mathrm{poly}(1/\epsilon)$ neurons.
Related papers
- Deep Neural Networks: Multi-Classification and Universal Approximation [0.0]
We demonstrate that a ReLU deep neural network with a width of $2$ and a depth of $2N+4M-1$ layers can achieve finite sample memorization for any dataset comprising $N$ elements.
We also provide depth estimates for approximating $W1,p$ functions and width estimates for approximating $Lp(Omega;mathbbRm)$ for $mgeq1$.
arXiv Detail & Related papers (2024-09-10T14:31:21Z) - Approximation Rates for Shallow ReLU$^k$ Neural Networks on Sobolev Spaces via the Radon Transform [4.096453902709292]
We consider the problem of how efficiently shallow neural networks with the ReLU$k$ activation function can approximate functions from Sobolev spaces.
We provide a simple proof of nearly optimal approximation rates in a variety of cases, including when $qleq p$, $pgeq 2$, and $s leq k + (d+1)/2$.
arXiv Detail & Related papers (2024-08-20T16:43:45Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data.
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Bayesian Inference with Deep Weakly Nonlinear Networks [57.95116787699412]
We show at a physics level of rigor that Bayesian inference with a fully connected neural network is solvable.
We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature.
arXiv Detail & Related papers (2024-05-26T17:08:04Z) - Learning Hierarchical Polynomials with Three-Layer Neural Networks [56.71223169861528]
We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks.
For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error.
This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
arXiv Detail & Related papers (2023-11-23T02:19:32Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - Shallow neural network representation of polynomials [91.3755431537592]
We show that $d$-variables of degreeR$ can be represented on $[0,1]d$ as shallow neural networks of width $d+1+sum_r=2Rbinomr+d-1d-1d-1[binomr+d-1d-1d-1[binomr+d-1d-1d-1[binomr+d-1d-1d-1d-1[binomr+d-1d-1d-1d-1
arXiv Detail & Related papers (2022-08-17T08:14:52Z) - Neural Networks can Learn Representations with Gradient Descent [68.95262816363288]
In specific regimes, neural networks trained by gradient descent behave like kernel methods.
In practice, it is known that neural networks strongly outperform their associated kernels.
arXiv Detail & Related papers (2022-06-30T09:24:02Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.