Statistical-Query Lower Bounds via Functional Gradients
- URL: http://arxiv.org/abs/2006.15812v2
- Date: Thu, 22 Oct 2020 21:10:48 GMT
- Title: Statistical-Query Lower Bounds via Functional Gradients
- Authors: Surbhi Goel, Aravind Gollakota, Adam Klivans
- Abstract summary: We show that any statistical-query algorithm with tolerance $n- (1/epsilon)b$ must use at least $2nc epsilon$ queries for some constant $b.
Our results rule out general (as opposed to correlational) SQ learning algorithms, which is unusual for real-valued learning problems.
- Score: 19.5924910463796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We give the first statistical-query lower bounds for agnostically learning
any non-polynomial activation with respect to Gaussian marginals (e.g., ReLU,
sigmoid, sign). For the specific problem of ReLU regression (equivalently,
agnostically learning a ReLU), we show that any statistical-query algorithm
with tolerance $n^{-(1/\epsilon)^b}$ must use at least $2^{n^c} \epsilon$
queries for some constant $b, c > 0$, where $n$ is the dimension and $\epsilon$
is the accuracy parameter. Our results rule out general (as opposed to
correlational) SQ learning algorithms, which is unusual for real-valued
learning problems. Our techniques involve a gradient boosting procedure for
"amplifying" recent lower bounds due to Diakonikolas et al. (COLT 2020) and
Goel et al. (ICML 2020) on the SQ dimension of functions computed by two-layer
neural networks. The crucial new ingredient is the use of a nonstandard convex
functional during the boosting procedure. This also yields a best-possible
reduction between two commonly studied models of learning: agnostic learning
and probabilistic concepts.
Related papers
- Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data.
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Agnostically Learning Multi-index Models with Queries [54.290489524576756]
We study the power of query access for the task of agnostic learning under the Gaussian distribution.
We show that query access gives significant runtime improvements over random examples for agnostically learning MIMs.
arXiv Detail & Related papers (2023-12-27T15:50:47Z) - Local Convergence of Approximate Newton Method for Two Layer Nonlinear
Regression [21.849997443967705]
Two-layer regression problem has been well-studied in prior works.
First layer is activated by a ReLU unit, and the second layer is activated by a softmax unit.
We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions.
arXiv Detail & Related papers (2023-11-26T19:19:02Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - The Optimality of Polynomial Regression for Agnostic Learning under
Gaussian Marginals [47.81107898315438]
We develop a method for finding hard families of examples for a wide class of problems by using duality LP.
We show that the $L1$-regression is essentially best possible, and therefore that the computational difficulty of learning a concept class is closely related to the degree required to approximate any function from the class in $L1$-norm.
arXiv Detail & Related papers (2021-02-08T18:06:32Z) - Learning to extrapolate using continued fractions: Predicting the
critical temperature of superconductor materials [5.905364646955811]
In the field of Artificial Intelligence (AI) and Machine Learning (ML), the approximation of unknown target functions $y=f(mathbfx)$ is a common objective.
We refer to $S$ as the training set and aim to identify a low-complexity mathematical model that can effectively approximate this target function for new instances $mathbfx$.
arXiv Detail & Related papers (2020-11-27T04:57:40Z) - Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning.
We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z) - Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation [30.137884459159107]
We consider the question of learning $Q$-function in a sample efficient manner for reinforcement learning with continuous state and action spaces.
We develop a simple, iterative learning algorithm that finds $epsilon$-Schmidt $Q$-function with sample complexity of $widetildeO(frac1epsilonmax(d1), d_2)+2)$ when the optimal $Q$-function has low rank $r$ and the factor $$ is below a certain threshold.
arXiv Detail & Related papers (2020-06-11T00:55:35Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.