Gradient-Based Empirical Risk Minimization using Local Polynomial
Regression
- URL: http://arxiv.org/abs/2011.02522v1
- Date: Wed, 4 Nov 2020 20:10:31 GMT
- Title: Gradient-Based Empirical Risk Minimization using Local Polynomial
Regression
- Authors: Ali Jadbabaie and Anuran Makur and Devavrat Shah
- Abstract summary: A major goal of this literature has been to compare different algorithms, such as gradient descent (GD) or gradient descent (SGD)
We demonstrate that when the loss function is smooth in the data, we can learn the oracle at every iteration and beat the oracle complexities of both GD and SGD.
- Score: 39.29885444997579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider the problem of empirical risk minimization (ERM)
of smooth, strongly convex loss functions using iterative gradient-based
methods. A major goal of this literature has been to compare different
algorithms, such as gradient descent (GD) or stochastic gradient descent (SGD),
by analyzing their rates of convergence to $\epsilon$-approximate solutions.
For example, the oracle complexity of GD is $O(n\log(\epsilon^{-1}))$, where
$n$ is the number of training samples. When $n$ is large, this can be expensive
in practice, and SGD is preferred due to its oracle complexity of
$O(\epsilon^{-1})$. Such standard analyses only utilize the smoothness of the
loss function in the parameter being optimized. In contrast, we demonstrate
that when the loss function is smooth in the data, we can learn the oracle at
every iteration and beat the oracle complexities of both GD and SGD in
important regimes. Specifically, at every iteration, our proposed algorithm
performs local polynomial regression to learn the gradient of the loss
function, and then estimates the true gradient of the ERM objective function.
We establish that the oracle complexity of our algorithm scales like
$\tilde{O}((p \epsilon^{-1})^{d/(2\eta)})$ (neglecting sub-dominant factors),
where $d$ and $p$ are the data and parameter space dimensions, respectively,
and the gradient of the loss function belongs to a $\eta$-H\"{o}lder class with
respect to the data. Our proof extends the analysis of local polynomial
regression in non-parametric statistics to provide interpolation guarantees in
multivariate settings, and also exploits tools from the inexact GD literature.
Unlike GD and SGD, the complexity of our method depends on $d$ and $p$.
However, when $d$ is small and the loss function exhibits modest smoothness in
the data, our algorithm beats GD and SGD in oracle complexity for a very broad
range of $p$ and $\epsilon$.
Related papers
- MGDA Converges under Generalized Smoothness, Provably [27.87166415148172]
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning.
Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions.
We study a more general and realistic class of generalized $ell$-smooth loss functions, where $ell$ is a general non-decreasing function of gradient norm.
arXiv Detail & Related papers (2024-05-29T18:36:59Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - Gradient Descent for Low-Rank Functions [36.56489593549855]
In machine learning tasks, e.g., training deep neural networks, the loss function varies significantly in only a few directions of the input.
Our proposed emphLowRank Descent finds an $epsilon gradient function by first identifying $mathcalO(plog(1/epsilon))$gd and $mathcalOp/epsilon2)$p/epsilon2)$.
arXiv Detail & Related papers (2022-06-16T15:58:05Z) - Optimal Gradient Sliding and its Application to Distributed Optimization
Under Similarity [121.83085611327654]
We structured convex optimization problems with additive objective $r:=p + q$, where $r$ is $mu$-strong convex similarity.
We proposed a method to solve problems master to agents' communication and local calls.
The proposed method is much sharper than the $mathcalO(sqrtL_q/mu)$ method.
arXiv Detail & Related papers (2022-05-30T14:28:02Z) - On Acceleration of Gradient-Based Empirical Risk Minimization using
Local Polynomial Regression [0.491574468325115]
We study the acceleration of the Local Polynomial Interpolation-based Gradient Descent method (LPIGD) recently proposed for the approximate solution empirical risk problems (ERM)
We focus on loss functions that are strongly convex and smooth with condition number $sigma$.
We propose two accelerated methods for the problem based on LPI-GD and show an oracle complexity of $tildeOleft(sqrtsigma md log (1/varepsilon)$.
arXiv Detail & Related papers (2022-04-16T02:39:45Z) - Federated Optimization of Smooth Loss Functions [35.944772011421264]
In this work, we study empirical risk minimization (ERM) within a federated learning framework.
We propose the Federated Low Rank Gradient Descent (FedLRGD) algorithm.
Our method solves the ERM problem at the server using inexact gradient descent.
arXiv Detail & Related papers (2022-01-06T07:40:51Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.