Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins
- URL: http://arxiv.org/abs/2010.00539v2
- Date: Sun, 14 Feb 2021 02:20:24 GMT
- Title: Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins
- Authors: Spencer Frei and Yuan Cao and Quanquan Gu
- Abstract summary: We show that gradient descent finds halfspaces with classification error $tilde O(mathsfOPT1/2) + varepsilon$ in $mathrmpoly(d,1/varepsilon)$ time and sample complexity.
- Score: 92.7662890047311
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We analyze the properties of gradient descent on convex surrogates for the
zero-one loss for the agnostic learning of linear halfspaces. If $\mathsf{OPT}$
is the best classification error achieved by a halfspace, by appealing to the
notion of soft margins we are able to show that gradient descent finds
halfspaces with classification error $\tilde O(\mathsf{OPT}^{1/2}) +
\varepsilon$ in $\mathrm{poly}(d,1/\varepsilon)$ time and sample complexity for
a broad class of distributions that includes log-concave isotropic
distributions as a subclass. Along the way we answer a question recently posed
by Ji et al. (2020) on how the tail behavior of a loss function can affect
sample complexity and runtime guarantees for gradient descent.
Related papers
- Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions.
We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z) - Near-Optimal Cryptographic Hardness of Agnostically Learning Halfspaces
and ReLU Regression under Gaussian Marginals [43.0867217287089]
We study the task of agnostically learning halfspaces under the Gaussian distribution.
We prove a near-optimal computational hardness result for this task.
arXiv Detail & Related papers (2023-02-13T16:46:23Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Provable Robustness of Adversarial Training for Learning Halfspaces with
Noise [95.84614821570283]
We analyze the properties of adversarial learning adversarially robust halfspaces in the presence of label noise.
To the best of our knowledge, this is the first work to show that adversarial training prov yields classifiers in noise.
arXiv Detail & Related papers (2021-04-19T16:35:38Z) - Faster Convergence of Stochastic Gradient Langevin Dynamics for
Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave.
At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z) - Gradient Methods Never Overfit On Separable Data [31.714928102950584]
We show that standard gradient methods never overfit on separable data.
We present non-asymptotic bounds on the number of margin violations over the dataset.
arXiv Detail & Related papers (2020-06-30T18:01:46Z) - Learning Halfspaces with Tsybakov Noise [50.659479930171585]
We study the learnability of halfspaces in the presence of Tsybakov noise.
We give an algorithm that achieves misclassification error $epsilon$ with respect to the true halfspace.
arXiv Detail & Related papers (2020-06-11T14:25:02Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.