Ridgeless Interpolation with Shallow ReLU Networks in $1D$ is Nearest
Neighbor Curvature Extrapolation and Provably Generalizes on Lipschitz
Functions
- URL: http://arxiv.org/abs/2109.12960v1
- Date: Mon, 27 Sep 2021 11:32:24 GMT
- Title: Ridgeless Interpolation with Shallow ReLU Networks in $1D$ is Nearest
Neighbor Curvature Extrapolation and Provably Generalizes on Lipschitz
Functions
- Authors: Boris Hanin
- Abstract summary: We prove a precise geometric description of all one layer ReLU networks $z(x;theta)$ with a single linear unit.
We refer to them as ridgeless ReLU interpolants.
Our results show that ridgeless ReLU interpolants achieve the best possible generalization for learning $1d$ Lipschitz functions.
- Score: 16.75218291152252
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We prove a precise geometric description of all one layer ReLU networks
$z(x;\theta)$ with a single linear unit and input/output dimensions equal to
one that interpolate a given dataset $\mathcal D=\{(x_i,f(x_i))\}$ and, among
all such interpolants, minimize the $\ell_2$-norm of the neuron weights. Such
networks can intuitively be thought of as those that minimize the mean-squared
error over $\mathcal D$ plus an infinitesimal weight decay penalty. We
therefore refer to them as ridgeless ReLU interpolants. Our description proves
that, to extrapolate values $z(x;\theta)$ for inputs $x\in (x_i,x_{i+1})$ lying
between two consecutive datapoints, a ridgeless ReLU interpolant simply
compares the signs of the discrete estimates for the curvature of $f$ at $x_i$
and $x_{i+1}$ derived from the dataset $\mathcal D$. If the curvature estimates
at $x_i$ and $x_{i+1}$ have different signs, then $z(x;\theta)$ must be linear
on $(x_i,x_{i+1})$. If in contrast the curvature estimates at $x_i$ and
$x_{i+1}$ are both positive (resp. negative), then $z(x;\theta)$ is convex
(resp. concave) on $(x_i,x_{i+1})$. Our results show that ridgeless ReLU
interpolants achieve the best possible generalization for learning $1d$
Lipschitz functions, up to universal constants.
Related papers
- The Communication Complexity of Approximating Matrix Rank [50.6867896228563]
We show that this problem has randomized communication complexity $Omega(frac1kcdot n2log|mathbbF|)$.
As an application, we obtain an $Omega(frac1kcdot n2log|mathbbF|)$ space lower bound for any streaming algorithm with $k$ passes.
arXiv Detail & Related papers (2024-10-26T06:21:42Z) - Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Interplay between depth and width for interpolation in neural ODEs [0.0]
We examine the interplay between their width $p$ and number of layer transitions $L$.
In the high-dimensional setting, we demonstrate that $p=O(N)$ neurons are likely sufficient to achieve exact control.
arXiv Detail & Related papers (2024-01-18T11:32:50Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - $\ell_p$-Regression in the Arbitrary Partition Model of Communication [59.89387020011663]
We consider the randomized communication complexity of the distributed $ell_p$-regression problem in the coordinator model.
For $p = 2$, i.e., least squares regression, we give the first optimal bound of $tildeTheta(sd2 + sd/epsilon)$ bits.
For $p in (1,2)$,we obtain an $tildeO(sd2/epsilon + sd/mathrmpoly(epsilon)$ upper bound.
arXiv Detail & Related papers (2023-07-11T08:51:53Z) - Optimal Sketching Bounds for Sparse Linear Regression [116.30196615349226]
We study oblivious sketching for $k$-sparse linear regression under various loss functions such as an $ell_p$ norm, or from a broad class of hinge-like loss functions.
We show that for sparse $ell$vareps regression, there is an oblivious distribution over sketches with $Theta(klog(d/k)/varepsilon2)$ rows, which is tight up to a constant factor.
We also show that sketching $O(mu2 klog(mu n d/varepsilon)/var
arXiv Detail & Related papers (2023-04-05T07:24:19Z) - Learning linear dynamical systems under convex constraints [4.4351901934764975]
We consider the problem of identification of linear dynamical systems from $T$ samples of a single trajectory.
$A*$ can be reliably estimated for values $T$ smaller than what is needed for unconstrained setting.
arXiv Detail & Related papers (2023-03-27T11:49:40Z) - Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector.
We show that this dependence on $d$ is optimal, up to logarithmic factors.
We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z) - Minimax Optimal Regression over Sobolev Spaces via Laplacian
Regularization on Neighborhood Graphs [25.597646488273558]
We study the statistical properties of Laplacian smoothing, a graph-based approach to nonparametric regression.
We prove that Laplacian smoothing is manifold-adaptive.
arXiv Detail & Related papers (2021-06-03T01:20:41Z) - The planted matching problem: Sharp threshold and infinite-order phase
transition [25.41713098167692]
We study the problem of reconstructing a perfect matching $M*$ hidden in a randomly weighted $ntimes n$ bipartite graph.
We show that if $sqrtd B(mathcalP,mathcalQ) ge 1+epsilon$ for an arbitrarily small constant $epsilon>0$, the reconstruction error for any estimator is shown to be bounded away from $0$.
arXiv Detail & Related papers (2021-03-17T00:59:33Z) - Robust testing of low-dimensional functions [8.927163098772589]
A recent work of the authors showed that linear $k$-juntas are testable.
Following the surge of interest in noise-tolerant property testing, in this paper we prove a noise-tolerant (or robust) version of this result.
We obtain a fully noise tolerant tester with query complexity $kO(mathsfpoly(log k/epsilon))$ for the class of intersection of $k$-halfspaces.
arXiv Detail & Related papers (2020-04-24T10:23:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.