Related papers: Network size and weights size for memorization with two-layers neural networks

Network size and weights size for memorization with two-layers neural networks

URL: http://arxiv.org/abs/2006.02855v2
Date: Tue, 3 Nov 2020 07:15:50 GMT
Title: Network size and weights size for memorization with two-layers neural networks
Authors: S\'ebastien Bubeck and Ronen Eldan and Yin Tat Lee and Dan Mikulincer
Abstract summary: We propose a new training procedure for ReLU networks, based on complex (as opposed to real) recombination of the neurons. We show approximate memorization with both $Oleft(fracnd cdot fraclog(1/epsilon)epsilonright)$ neurons, as well as nearly-optimal size of the weights.
Score: 15.333300054767726
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In 1988, Eric B. Baum showed that two-layers neural networks with threshold activation function can perfectly memorize the binary labels of $n$ points in general position in $\mathbb{R}^d$ using only $\ulcorner n/d \urcorner$ neurons. We observe that with ReLU networks, using four times as many neurons one can fit arbitrary real labels. Moreover, for approximate memorization up to error $\epsilon$, the neural tangent kernel can also memorize with only $O\left(\frac{n}{d} \cdot \log(1/\epsilon) \right)$ neurons (assuming that the data is well dispersed too). We show however that these constructions give rise to networks where the magnitude of the neurons' weights are far from optimal. In contrast we propose a new training procedure for ReLU networks, based on complex (as opposed to real) recombination of the neurons, for which we show approximate memorization with both $O\left(\frac{n}{d} \cdot \frac{\log(1/\epsilon)}{\epsilon}\right)$ neurons, as well as nearly-optimal size of the weights.

Related papers

Stably unactivated neurons in ReLU neural networks [1.347660513756976]
In ReLU neural networks, the presence of stably unactivated neurons can reduce the network's expressiveness. In this work, we investigate the probability of a neuron in the second hidden layer of such neural networks being stably unactivated.
arXiv Detail & Related papers (2024-12-06T22:15:22Z)
Memorization Capacity for Additive Fine-Tuning with Small ReLU Networks [16.320374162259117]
Fine-Tuning Capacity (FTC) is defined as the maximum number of samples a neural network can fine-tune. We show that $N$ samples can be fine-tuned with $m=Theta(N)$ neurons for 2-layer networks, and with $m=Theta(sqrtN)$ neurons for 3-layer networks, no matter how large $K$ is.
arXiv Detail & Related papers (2024-08-01T07:58:51Z)
Should Under-parameterized Student Networks Copy or Average Teacher Weights? [7.777410338143785]
We consider the case when $f*$ itself is a neural network with one hidden layer and $k$ neurons. As the student has fewer neurons than the teacher, it is unclear whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. We find for the erf activation function that flow gradient converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron.
arXiv Detail & Related papers (2023-11-03T00:21:36Z)
Rates of Approximation by ReLU Shallow Neural Networks [8.22379888383833]
We show that ReLU shallow neural networks with $m$ hidden neurons can uniformly approximate functions from the H"older space. Such rates are very close to the optimal one $O(m-fracrd)$ in the sense that $fracd+2d+4d+4$ is close to $1$, when the dimension $d$ is large.
arXiv Detail & Related papers (2023-07-24T00:16:50Z)
Generalization Ability of Wide Neural Networks on $\mathbb{R}$ [8.508360765158326]
We study the generalization ability of the wide two-layer ReLU neural network on $mathbbR$. We show that: $i)$ when the width $mrightarrowinfty$, the neural network kernel (NNK) uniformly converges to the NTK; $ii)$ the minimax rate of regression over the RKHS associated to $K_1$ is $n-2/3$; $iii)$ if one adopts the early stopping strategy in training a wide neural network, the resulting neural network achieves the minimax rate; $iv
arXiv Detail & Related papers (2023-02-12T15:07:27Z)
The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$. We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z)
Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks. We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z)
On the Optimal Memorization Power of ReLU Neural Networks [53.15475693468925]
We show that feedforward ReLU neural networks can memorization any $N$ points that satisfy a mild separability assumption. We prove that having such a large bit complexity is both necessary and sufficient for memorization with a sub-linear number of parameters.
arXiv Detail & Related papers (2021-10-07T05:25:23Z)
The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability. We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z)
An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks [40.489350374378645]
We prove that $widetildemathcalO(e1/delta2+sqrtn)$ neurons and $widetildemathcalO(fracddelta+n)$ weights are sufficient. We also prove new lower bounds by connecting in neural networks to the purely geometric problem of separating $n$ points on a sphere using hyperplanes.
arXiv Detail & Related papers (2021-06-14T19:42:32Z)
Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network. We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z)
Towards Understanding Hierarchical Learning: Benefits of Neural Representations [160.33479656108926]
In this work, we demonstrate that intermediate neural representations add more flexibility to neural networks. We show that neural representation can achieve improved sample complexities compared with the raw input. Our results characterize when neural representations are beneficial, and may provide a new perspective on why depth is important in deep learning.
arXiv Detail & Related papers (2020-06-24T02:44:54Z)
Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy. We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)
A Corrective View of Neural Networks: Representation, Memorization and Learning [26.87238691716307]
We develop a corrective mechanism for neural network approximation. We show that two-layer neural networks in the random features regime (RF) can memorize arbitrary labels. We also consider three-layer neural networks and show that the corrective mechanism yields faster representation rates for smooth radial functions.
arXiv Detail & Related papers (2020-02-01T20:51:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.