Related papers: Tractability from overparametrization: The example of the negative perceptron

Tractability from overparametrization: The example of the negative perceptron

URL: http://arxiv.org/abs/2110.15824v3
Date: Tue, 4 Jul 2023 03:33:28 GMT
Title: Tractability from overparametrization: The example of the negative perceptron
Authors: Andrea Montanari, Yiqiao Zhong, Kangjie Zhou
Abstract summary: We analyze a linear programming algorithm to characterize the corresponding threshold threshold $delta_textlin(kappa)$. We observe a gap between the threshold $delta_textlin(kappa)$, raising question of the behavior of other algorithms.
Score: 9.077560551700163
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the negative perceptron problem we are given $n$ data points $({\boldsymbol x}_i,y_i)$, where ${\boldsymbol x}_i$ is a $d$-dimensional vector and $y_i\in\{+1,-1\}$ is a binary label. The data are not linearly separable and hence we content ourselves to find a linear classifier with the largest possible \emph{negative} margin. In other words, we want to find a unit norm vector ${\boldsymbol \theta}$ that maximizes $\min_{i\le n}y_i\langle {\boldsymbol \theta},{\boldsymbol x}_i\rangle$. This is a non-convex optimization problem (it is equivalent to finding a maximum norm vector in a polytope), and we study its typical properties under two random models for the data. We consider the proportional asymptotics in which $n,d\to \infty$ with $n/d\to\delta$, and prove upper and lower bounds on the maximum margin $\kappa_{\text{s}}(\delta)$ or -- equivalently -- on its inverse function $\delta_{\text{s}}(\kappa)$. In other words, $\delta_{\text{s}}(\kappa)$ is the overparametrization threshold: for $n/d\le \delta_{\text{s}}(\kappa)-\varepsilon$ a classifier achieving vanishing training error exists with high probability, while for $n/d\ge \delta_{\text{s}}(\kappa)+\varepsilon$ it does not. Our bounds on $\delta_{\text{s}}(\kappa)$ match to the leading order as $\kappa\to -\infty$. We then analyze a linear programming algorithm to find a solution, and characterize the corresponding threshold $\delta_{\text{lin}}(\kappa)$. We observe a gap between the interpolation threshold $\delta_{\text{s}}(\kappa)$ and the linear programming threshold $\delta_{\text{lin}}(\kappa)$, raising the question of the behavior of other algorithms.

Related papers

Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch. We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work. We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z)
$\ell_p$-Regression in the Arbitrary Partition Model of Communication [59.89387020011663]
We consider the randomized communication complexity of the distributed $ell_p$-regression problem in the coordinator model. For $p = 2$, i.e., least squares regression, we give the first optimal bound of $tildeTheta(sd2 + sd/epsilon)$ bits. For $p in (1,2)$,we obtain an $tildeO(sd2/epsilon + sd/mathrmpoly(epsilon)$ upper bound.
arXiv Detail & Related papers (2023-07-11T08:51:53Z)
Optimal Sketching Bounds for Sparse Linear Regression [116.30196615349226]
We study oblivious sketching for $k$-sparse linear regression under various loss functions such as an $ell_p$ norm, or from a broad class of hinge-like loss functions. We show that for sparse $ell$vareps regression, there is an oblivious distribution over sketches with $Theta(klog(d/k)/varepsilon2)$ rows, which is tight up to a constant factor. We also show that sketching $O(mu2 klog(mu n d/varepsilon)/var
arXiv Detail & Related papers (2023-04-05T07:24:19Z)
A spectral least-squares-type method for heavy-tailed corrupted regression with unknown covariance \& heterogeneous noise [2.019622939313173]
We revisit heavy-tailed corrupted least-squares linear regression assuming to have a corrupted $n$-sized label-feature sample of at most $epsilon n$ arbitrary outliers. We propose a near-optimal computationally tractable estimator, based on the power method, assuming no knowledge on $(Sigma,Xi) nor the operator norm of $Xi$.
arXiv Detail & Related papers (2022-09-06T23:37:31Z)
Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm. Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z)
Finding Second-Order Stationary Point for Nonconvex-Strongly-Concave Minimax Problem [16.689304539024036]
In this paper, we consider non-asymotic behavior of finding second-order stationary point for minimax problem. For high-dimensional problems, we propose anf to expensive cost form second-order oracle which solves the cubic sub-problem in gradient via descent and Chebyshev expansion.
arXiv Detail & Related papers (2021-10-10T14:54:23Z)
Spiked Covariance Estimation from Modulo-Reduced Measurements [14.569322713960494]
We develop and analyze an algorithm that, for most directions $bfu$ and $nu=mathrmpoly(k)$, estimates $bfu$ to high accuracy using $n=mathrmpoly(k)$ measurements. Numerical experiments show that the developed algorithm performs well even in a non-asymptotic setting.
arXiv Detail & Related papers (2021-10-04T02:10:47Z)
Self-training Converts Weak Learners to Strong Learners in Mixture Models [86.7137362125503]
We show that a pseudolabeler $boldsymbolbeta_mathrmpl$ can achieve classification error at most $C_mathrmerr$. We additionally show that by running gradient descent on the logistic loss one can obtain a pseudolabeler $boldsymbolbeta_mathrmpl$ with classification error $C_mathrmerr$ using only $O(d)$ labeled examples.
arXiv Detail & Related papers (2021-06-25T17:59:16Z)
Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm [46.36534144138337]
In this paper, we investigate the sample complexity of policy evaluation in offline reinforcement learning. Under the low distribution shift assumption, we show that there is an algorithm that needs at most $Oleft(maxleft fracleftVert thetapirightVert _24varepsilon4logfracddelta,frac1varepsilon2left(d+logfrac1deltaright)right right)$ samples to approximate the
arXiv Detail & Related papers (2021-03-17T18:18:57Z)
Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure. We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds. We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z)
On Efficient Low Distortion Ultrametric Embedding [18.227854382422112]
A widely-used method to preserve the underlying hierarchical structure of the data is to find an embedding of the data into a tree or an ultrametric. In this paper, we provide a new algorithm which takes as input a set of points isometric in $mathbbRd2 (for universal constant $rho>1$) to output an ultrametric $Delta. We show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.
arXiv Detail & Related papers (2020-08-15T11:06:45Z)
Fast Differentiable Clipping-Aware Normalization and Rescaling [22.320256458354137]
We show that the optimal rescaling can be found analytically using a fast and differentiable algorithm. Our algorithm works for any p-norm and can be used to train neural networks on inputs normalized to perturbations.
arXiv Detail & Related papers (2020-07-15T13:43:22Z)
Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems. We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.