Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent
- URL: http://arxiv.org/abs/2206.08918v1
- Date: Fri, 17 Jun 2022 17:55:43 GMT
- Title: Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent
- Authors: Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis
- Abstract summary: We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
- Score: 50.659479930171585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the fundamental problem of learning a single neuron, i.e., a
function of the form $\mathbf{x}\mapsto\sigma(\mathbf{w}\cdot\mathbf{x})$ for
monotone activations $\sigma:\mathbb{R}\mapsto\mathbb{R}$, with respect to the
$L_2^2$-loss in the presence of adversarial label noise. Specifically, we are
given labeled examples from a distribution $D$ on $(\mathbf{x},
y)\in\mathbb{R}^d \times \mathbb{R}$ such that there exists
$\mathbf{w}^\ast\in\mathbb{R}^d$ achieving $F(\mathbf{w}^\ast)=\epsilon$, where
$F(\mathbf{w})=\mathbf{E}_{(\mathbf{x},y)\sim D}[(\sigma(\mathbf{w}\cdot
\mathbf{x})-y)^2]$. The goal of the learner is to output a hypothesis vector
$\mathbf{w}$ such that $F(\mathbb{w})=C\, \epsilon$ with high probability,
where $C>1$ is a universal constant. As our main contribution, we give
efficient constant-factor approximate learners for a broad class of
distributions (including log-concave distributions) and activation functions.
Concretely, for the class of isotropic log-concave distributions, we obtain the
following important corollaries:
For the logistic activation, we obtain the first polynomial-time constant
factor approximation (even under the Gaussian distribution). Our algorithm has
sample complexity $\widetilde{O}(d/\epsilon)$, which is tight within
polylogarithmic factors.
For the ReLU activation, we give an efficient algorithm with sample
complexity $\tilde{O}(d\, \polylog(1/\epsilon))$. Prior to our work, the best
known constant-factor approximate learner had sample complexity
$\tilde{\Omega}(d/\epsilon)$.
In both of these settings, our algorithms are simple, performing
gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our
algorithms relies on novel structural results that we establish, showing that
(essentially all) stationary points of the underlying non-convex loss are
approximately optimal.
Related papers
- Near-Optimal and Tractable Estimation under Shift-Invariance [0.21756081703275998]
Class of all such signals is but extremely rich: it contains all exponential oscillations over $mathbbCn$ with total degree $s$.
We show that the statistical complexity of this class, as measured by the radius squared minimax frequencies of the $(delta)$-confidence $ell$-ball, is nearly the same as for the class of $s$-sparse signals, namely $Oleft(slog(en) + log(delta-1)right) cdot log(en/s)
arXiv Detail & Related papers (2024-11-05T18:11:23Z) - Iterative thresholding for non-linear learning in the strong $\varepsilon$-contamination model [3.309767076331365]
We derive approximation bounds for learning single neuron models using thresholded descent.
We also study the linear regression problem, where $sigma(mathbfx) = mathbfx$.
arXiv Detail & Related papers (2024-09-05T16:59:56Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Optimal Estimator for Linear Regression with Shuffled Labels [17.99906229036223]
This paper considers the task of linear regression with shuffled labels.
$mathbf Y in mathbb Rntimes m, mathbf Pi in mathbb Rntimes p, mathbf B in mathbb Rptimes m$, and $mathbf Win mathbb Rntimes m$, respectively.
arXiv Detail & Related papers (2023-10-02T16:44:47Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - An Over-parameterized Exponential Regression [18.57735939471469]
Recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions.
We define the neural function $F: mathbbRd times m times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd
arXiv Detail & Related papers (2023-03-29T07:29:07Z) - Threshold Phenomena in Learning Halfspaces with Massart Noise [56.01192577666607]
We study the problem of PAC learning halfspaces on $mathbbRd$ with Massart noise under Gaussian marginals.
Our results qualitatively characterize the complexity of learning halfspaces in the Massart model.
arXiv Detail & Related papers (2021-08-19T16:16:48Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.