Learning the mapping $\mathbf{x}\mapsto \sum_{i=1}^d x_i^2$: the cost of
finding the needle in a haystack
- URL: http://arxiv.org/abs/2002.10561v2
- Date: Sat, 16 May 2020 21:14:02 GMT
- Title: Learning the mapping $\mathbf{x}\mapsto \sum_{i=1}^d x_i^2$: the cost of
finding the needle in a haystack
- Authors: Jiefu Zhang, Leonardo Zepeda-N\'u\~nez, Yuan Yao, Lin Lin
- Abstract summary: We show that the cost of finding the needle in a haystack is related to the Barron norm of the function.
In order to control the size of the generalization gap, we find that the use of explicit regularization becomes increasingly more important as $d$ increases.
- Score: 6.436049723896002
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of using machine learning to approximate the mapping
$\mathbf{x}\mapsto\sum_{i=1}^d x_i^2$ with $x_i\in[-1,1]$ seems to be a trivial
one. Given the knowledge of the separable structure of the function, one can
design a sparse network to represent the function very accurately, or even
exactly. When such structural information is not available, and we may only use
a dense neural network, the optimization procedure to find the sparse network
embedded in the dense network is similar to finding the needle in a haystack,
using a given number of samples of the function. We demonstrate that the cost
(measured by sample complexity) of finding the needle is directly related to
the Barron norm of the function. While only a small number of samples is needed
to train a sparse network, the dense network trained with the same number of
samples exhibits large test loss and a large generalization gap. In order to
control the size of the generalization gap, we find that the use of explicit
regularization becomes increasingly more important as $d$ increases. The
numerically observed sample complexity with explicit regularization scales as
$\mathcal{O}(d^{2.5})$, which is in fact better than the theoretically
predicted sample complexity that scales as $\mathcal{O}(d^{4})$. Without
explicit regularization (also called implicit regularization), the numerically
observed sample complexity is significantly higher and is close to
$\mathcal{O}(d^{4.5})$.
Related papers
- Sample and Computationally Efficient Robust Learning of Gaussian Single-Index Models [37.42736399673992]
A single-index model (SIM) is a function of the form $sigma(mathbfwast cdot mathbfx)$, where $sigma: mathbbR to mathbbR$ is a known link function and $mathbfwast$ is a hidden unit vector.
We show that a proper learner attains $L2$-error of $O(mathrmOPT)+epsilon$, where $
arXiv Detail & Related papers (2024-11-08T17:10:38Z) - Optimal Sample Complexity of Contrastive Learning [12.108926584457736]
We study the complexity of contrastive learning, i.e. the minimum number of labeled generalizations sufficient for getting high accuracy.
We give tight bounds on the sample complexity in a variety of settings, focusing on arbitrary distance functions.
We show that the theoretical bounds on sample complexity obtained via VC/Natarajan can have strong predictive power for experimental results.
arXiv Detail & Related papers (2023-12-01T06:57:11Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Near-Optimal Bounds for Learning Gaussian Halfspaces with Random
Classification Noise [50.64137465792738]
We show that any efficient SQ algorithm for the problem requires sample complexity at least $Omega(d1/2/(maxp, epsilon)2)$.
Our lower bound suggests that this quadratic dependence on $1/epsilon$ is inherent for efficient algorithms.
arXiv Detail & Related papers (2023-07-13T18:59:28Z) - Is Stochastic Gradient Descent Near Optimal? [0.0]
We show that gradient descent achieves small expected error with a number of samples and total number of queries.
This suggests that SGD nearly achieves the information-theoretic sample complexity bounds of Joen & Van Roy (arXiv:2203.00246) in a computationally efficient manner.
arXiv Detail & Related papers (2022-09-18T18:26:43Z) - Approximate Function Evaluation via Multi-Armed Bandits [51.146684847667125]
We study the problem of estimating the value of a known smooth function $f$ at an unknown point $boldsymbolmu in mathbbRn$, where each component $mu_i$ can be sampled via a noisy oracle.
We design an instance-adaptive algorithm that learns to sample according to the importance of each coordinate, and with probability at least $1-delta$ returns an $epsilon$ accurate estimate of $f(boldsymbolmu)$.
arXiv Detail & Related papers (2022-03-18T18:50:52Z) - The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability.
We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z) - Statistically Meaningful Approximation: a Case Study on Approximating
Turing Machines with Transformers [50.85524803885483]
This work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical learnability.
We study SM approximation for two function classes: circuits and Turing machines.
arXiv Detail & Related papers (2021-07-28T04:28:55Z) - Why Are Convolutional Nets More Sample-Efficient than Fully-Connected
Nets? [33.51250867983687]
We show a natural task on which a provable sample complexity gap can be shown, for standard training algorithms.
We demonstrate a single target function, learning which on all possible distributions leads to an $O(1)$ vs $Omega(d2/varepsilon)$ gap.
Similar results are achieved for $ell$ regression and adaptive training algorithms, e.g. Adam and AdaGrad.
arXiv Detail & Related papers (2020-10-16T17:15:39Z) - On the Modularity of Hypernetworks [103.1147622394852]
We show that for a structured target function, the overall number of trainable parameters in a hypernetwork is smaller by orders of magnitude than the number of trainable parameters of a standard neural network and an embedding method.
arXiv Detail & Related papers (2020-02-23T22:51:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.