Related papers: On Acceleration of Gradient-Based Empirical Risk Minimization using Local Polynomial Regression

On Acceleration of Gradient-Based Empirical Risk Minimization using Local Polynomial Regression

URL: http://arxiv.org/abs/2204.07702v1
Date: Sat, 16 Apr 2022 02:39:45 GMT
Title: On Acceleration of Gradient-Based Empirical Risk Minimization using Local Polynomial Regression
Authors: Ekaterina Trimbach, Edward Duc Hien Nguyen, and C\'esar A. Uribe
Abstract summary: We study the acceleration of the Local Polynomial Interpolation-based Gradient Descent method (LPIGD) recently proposed for the approximate solution empirical risk problems (ERM) We focus on loss functions that are strongly convex and smooth with condition number $sigma$. We propose two accelerated methods for the problem based on LPI-GD and show an oracle complexity of $tildeOleft(sqrtsigma md log (1/varepsilon)$.
Score: 0.491574468325115
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the acceleration of the Local Polynomial Interpolation-based Gradient Descent method (LPI-GD) recently proposed for the approximate solution of empirical risk minimization problems (ERM). We focus on loss functions that are strongly convex and smooth with condition number $\sigma$. We additionally assume the loss function is $\eta$-H\"older continuous with respect to the data. The oracle complexity of LPI-GD is $\tilde{O}\left(\sigma m^d \log(1/\varepsilon)\right)$ for a desired accuracy $\varepsilon$, where $d$ is the dimension of the parameter space, and $m$ is the cardinality of an approximation grid. The factor $m^d$ can be shown to scale as $O((1/\varepsilon)^{d/2\eta})$. LPI-GD has been shown to have better oracle complexity than gradient descent (GD) and stochastic gradient descent (SGD) for certain parameter regimes. We propose two accelerated methods for the ERM problem based on LPI-GD and show an oracle complexity of $\tilde{O}\left(\sqrt{\sigma} m^d \log(1/\varepsilon)\right)$. Moreover, we provide the first empirical study on local polynomial interpolation-based gradient methods and corroborate that LPI-GD has better performance than GD and SGD in some scenarios, and the proposed methods achieve acceleration.

Related papers

Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Algorithms for mean-field variational inference via polyhedral optimization in the Wasserstein space [10.292118864147097]
We develop a theory of finite-dimensional polyhedral subsets over the Wasserstein space and optimization of functionals over them via first-order methods. Our main application is to the problem of mean-field variational inference, which seeks to approximate a distribution $pi$ over $mathbbRd$ by a product measure $pistar$. As a byproduct of our analysis, we obtain the first end-to-end analysis for gradient-based algorithms for MFVI.
arXiv Detail & Related papers (2023-12-05T16:02:04Z)
Span-Based Optimal Sample Complexity for Average Reward MDPs [6.996002801232415]
We study the sample complexity of learning an $varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. We establish the complexity bound $widetildeOleft(SAfracH (1-gamma)2varepsilon2 right)$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space.
arXiv Detail & Related papers (2023-11-22T15:34:44Z)
Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP) We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z)
Adaptive Stochastic Variance Reduction for Non-convex Finite-Sum Minimization [52.25843977506935]
We propose an adaptive variance method, called AdaSpider, for $L$-smooth, non-reduction functions with a finitesum structure. In doing so, we are able to compute an $epsilon-stationary point with $tildeOleft + st/epsilon calls.
arXiv Detail & Related papers (2022-11-03T14:41:46Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model. The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)
Gradient Descent for Low-Rank Functions [36.56489593549855]
In machine learning tasks, e.g., training deep neural networks, the loss function varies significantly in only a few directions of the input. Our proposed emphLowRank Descent finds an $epsilon gradient function by first identifying $mathcalO(plog(1/epsilon))$gd and $mathcalOp/epsilon2)$p/epsilon2)$.
arXiv Detail & Related papers (2022-06-16T15:58:05Z)
A Variance-Reduced Stochastic Accelerated Primal Dual Algorithm [3.2958527541557525]
Such problems arise frequently in machine learning in the context of robust empirical risk minimization. We consider the accelerated primal dual (SAPD) algorithm as a robust method against gradient noise. We show that our method improves upon SAPD both in practice and in theory.
arXiv Detail & Related papers (2022-02-19T22:12:30Z)
Faster Perturbed Stochastic Gradient Methods for Finding Local Minima [92.99933928528797]
We propose tttPullback, a faster perturbed gradient framework for finding local minima. We show that Pullback with gradient estimators such as SARAH/SP and STORM can find $(epsilon, epsilon_H)$approximate local minima within $tilde O(epsilon-3 + H-6)$. The core idea of our framework is a step-size pullback'' scheme to control the average movement of the gradient evaluations.
arXiv Detail & Related papers (2021-10-25T07:20:05Z)
Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent [7.176107039687231]
We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise. We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal. Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
arXiv Detail & Related papers (2021-10-21T19:22:14Z)
Gradient-Based Empirical Risk Minimization using Local Polynomial Regression [39.29885444997579]
A major goal of this literature has been to compare different algorithms, such as gradient descent (GD) or gradient descent (SGD) We demonstrate that when the loss function is smooth in the data, we can learn the oracle at every iteration and beat the oracle complexities of both GD and SGD.
arXiv Detail & Related papers (2020-11-04T20:10:31Z)
Faster Convergence of Stochastic Gradient Langevin Dynamics for Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave. At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z)
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model. For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.