Tractability from overparametrization: The example of the negative
perceptron
- URL: http://arxiv.org/abs/2110.15824v3
- Date: Tue, 4 Jul 2023 03:33:28 GMT
- Title: Tractability from overparametrization: The example of the negative
perceptron
- Authors: Andrea Montanari, Yiqiao Zhong, Kangjie Zhou
- Abstract summary: We analyze a linear programming algorithm to characterize the corresponding threshold threshold $delta_textlin(kappa)$.
We observe a gap between the threshold $delta_textlin(kappa)$, raising question of the behavior of other algorithms.
- Score: 9.077560551700163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the negative perceptron problem we are given $n$ data points
$({\boldsymbol x}_i,y_i)$, where ${\boldsymbol x}_i$ is a $d$-dimensional
vector and $y_i\in\{+1,-1\}$ is a binary label. The data are not linearly
separable and hence we content ourselves to find a linear classifier with the
largest possible \emph{negative} margin. In other words, we want to find a unit
norm vector ${\boldsymbol \theta}$ that maximizes $\min_{i\le n}y_i\langle
{\boldsymbol \theta},{\boldsymbol x}_i\rangle$. This is a non-convex
optimization problem (it is equivalent to finding a maximum norm vector in a
polytope), and we study its typical properties under two random models for the
data.
We consider the proportional asymptotics in which $n,d\to \infty$ with
$n/d\to\delta$, and prove upper and lower bounds on the maximum margin
$\kappa_{\text{s}}(\delta)$ or -- equivalently -- on its inverse function
$\delta_{\text{s}}(\kappa)$. In other words, $\delta_{\text{s}}(\kappa)$ is the
overparametrization threshold: for $n/d\le
\delta_{\text{s}}(\kappa)-\varepsilon$ a classifier achieving vanishing
training error exists with high probability, while for $n/d\ge
\delta_{\text{s}}(\kappa)+\varepsilon$ it does not. Our bounds on
$\delta_{\text{s}}(\kappa)$ match to the leading order as $\kappa\to -\infty$.
We then analyze a linear programming algorithm to find a solution, and
characterize the corresponding threshold $\delta_{\text{lin}}(\kappa)$. We
observe a gap between the interpolation threshold $\delta_{\text{s}}(\kappa)$
and the linear programming threshold $\delta_{\text{lin}}(\kappa)$, raising the
question of the behavior of other algorithms.
Related papers
- Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - $\ell_p$-Regression in the Arbitrary Partition Model of Communication [59.89387020011663]
We consider the randomized communication complexity of the distributed $ell_p$-regression problem in the coordinator model.
For $p = 2$, i.e., least squares regression, we give the first optimal bound of $tildeTheta(sd2 + sd/epsilon)$ bits.
For $p in (1,2)$,we obtain an $tildeO(sd2/epsilon + sd/mathrmpoly(epsilon)$ upper bound.
arXiv Detail & Related papers (2023-07-11T08:51:53Z) - A spectral least-squares-type method for heavy-tailed corrupted
regression with unknown covariance \& heterogeneous noise [2.019622939313173]
We revisit heavy-tailed corrupted least-squares linear regression assuming to have a corrupted $n$-sized label-feature sample of at most $epsilon n$ arbitrary outliers.
We propose a near-optimal computationally tractable estimator, based on the power method, assuming no knowledge on $(Sigma,Xi) nor the operator norm of $Xi$.
arXiv Detail & Related papers (2022-09-06T23:37:31Z) - Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm.
Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z) - Self-training Converts Weak Learners to Strong Learners in Mixture
Models [86.7137362125503]
We show that a pseudolabeler $boldsymbolbeta_mathrmpl$ can achieve classification error at most $C_mathrmerr$.
We additionally show that by running gradient descent on the logistic loss one can obtain a pseudolabeler $boldsymbolbeta_mathrmpl$ with classification error $C_mathrmerr$ using only $O(d)$ labeled examples.
arXiv Detail & Related papers (2021-06-25T17:59:16Z) - Infinite-Horizon Offline Reinforcement Learning with Linear Function
Approximation: Curse of Dimensionality and Algorithm [46.36534144138337]
In this paper, we investigate the sample complexity of policy evaluation in offline reinforcement learning.
Under the low distribution shift assumption, we show that there is an algorithm that needs at most $Oleft(maxleft fracleftVert thetapirightVert _24varepsilon4logfracddelta,frac1varepsilon2left(d+logfrac1deltaright)right right)$ samples to approximate the
arXiv Detail & Related papers (2021-03-17T18:18:57Z) - Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure.
We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds.
We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z) - On Efficient Low Distortion Ultrametric Embedding [18.227854382422112]
A widely-used method to preserve the underlying hierarchical structure of the data is to find an embedding of the data into a tree or an ultrametric.
In this paper, we provide a new algorithm which takes as input a set of points isometric in $mathbbRd2 (for universal constant $rho>1$) to output an ultrametric $Delta.
We show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.
arXiv Detail & Related papers (2020-08-15T11:06:45Z) - Fast Differentiable Clipping-Aware Normalization and Rescaling [22.320256458354137]
We show that the optimal rescaling can be found analytically using a fast and differentiable algorithm.
Our algorithm works for any p-norm and can be used to train neural networks on inputs normalized to perturbations.
arXiv Detail & Related papers (2020-07-15T13:43:22Z) - Agnostic Q-learning with Function Approximation in Deterministic
Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems.
We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.