On the Sample Complexity of Two-Layer Networks: Lipschitz vs.
Element-Wise Lipschitz Activation
- URL: http://arxiv.org/abs/2211.09634v4
- Date: Sat, 20 Jan 2024 22:04:21 GMT
- Title: On the Sample Complexity of Two-Layer Networks: Lipschitz vs.
Element-Wise Lipschitz Activation
- Authors: Amit Daniely and Elad Granot
- Abstract summary: We investigate the sample complexity of bounded two-layer neural networks using different activation functions.
We prove that if $sigma$ is element-wise, then the sample complexity of $mathcalH$ has only logarithmic dependency in width.
- Score: 20.70453775428433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the sample complexity of bounded two-layer neural networks
using different activation functions.
In particular, we consider the class
$$ \mathcal{H} = \left\{\textbf{x}\mapsto \langle \textbf{v}, \sigma \circ
W\textbf{b} + \textbf{b} \rangle : \textbf{b}\in\mathbb{R}^d, W \in
\mathbb{R}^{\mathcal{T}\times d}, \textbf{v} \in
\mathbb{R}^{\mathcal{T}}\right\} $$
where the spectral norm of $W$ and $\textbf{v}$ is bounded by $O(1)$, the
Frobenius norm of $W$ is bounded from its initialization by $R > 0$, and
$\sigma$ is a Lipschitz activation function.
We prove that if $\sigma$ is element-wise, then the sample complexity of
$\mathcal{H}$ has only logarithmic dependency in width and that this complexity
is tight, up to logarithmic factors.
We further show that the element-wise property of $\sigma$ is essential for a
logarithmic dependency bound in width, in the sense that there exist
non-element-wise activation functions whose sample complexity is linear in
width, for widths that can be up to exponential in the input dimension.
For the upper bound, we use the recent approach for norm-based bounds named
Approximate Description Length (ADL) by arXiv:1910.05697.
We further develop new techniques and tools for this approach that will
hopefully inspire future works.
Related papers
- The Communication Complexity of Approximating Matrix Rank [50.6867896228563]
We show that this problem has randomized communication complexity $Omega(frac1kcdot n2log|mathbbF|)$.
As an application, we obtain an $Omega(frac1kcdot n2log|mathbbF|)$ space lower bound for any streaming algorithm with $k$ passes.
arXiv Detail & Related papers (2024-10-26T06:21:42Z) - Iterative thresholding for non-linear learning in the strong $\varepsilon$-contamination model [3.309767076331365]
We derive approximation bounds for learning single neuron models using thresholded descent.
We also study the linear regression problem, where $sigma(mathbfx) = mathbfx$.
arXiv Detail & Related papers (2024-09-05T16:59:56Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data.
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Optimal Estimator for Linear Regression with Shuffled Labels [17.99906229036223]
This paper considers the task of linear regression with shuffled labels.
$mathbf Y in mathbb Rntimes m, mathbf Pi in mathbb Rntimes p, mathbf B in mathbb Rptimes m$, and $mathbf Win mathbb Rntimes m$, respectively.
arXiv Detail & Related papers (2023-10-02T16:44:47Z) - An Over-parameterized Exponential Regression [18.57735939471469]
Recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions.
We define the neural function $F: mathbbRd times m times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd
arXiv Detail & Related papers (2023-03-29T07:29:07Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z) - On the Complexity of Minimizing Convex Finite Sums Without Using the
Indices of the Individual Functions [62.01594253618911]
We exploit the finite noise structure of finite sums to derive a matching $O(n2)$-upper bound under the global oracle model.
Following a similar approach, we propose a novel adaptation of SVRG which is both emphcompatible with oracles, and achieves complexity bounds of $tildeO(n2+nsqrtL/mu)log (1/epsilon)$ and $O(nsqrtL/epsilon)$, for $mu>0$ and $mu=0$
arXiv Detail & Related papers (2020-02-09T03:39:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.