Backtracking Gradient Descent allowing unbounded learning rates
- URL: http://arxiv.org/abs/2001.02005v2
- Date: Wed, 8 Jan 2020 16:50:06 GMT
- Title: Backtracking Gradient Descent allowing unbounded learning rates
- Authors: Tuyen Trung Truong
- Abstract summary: In unconstrained optimisation on an Euclidean space, to prove convergence in Gradient Descent processes (GD) $x_n+1=x_n-delta _n nabla f(x_n)$ it usually is required that the learning rates $delta _n$'s are bounded.
In this paper, we allow the learning rates $delta _n$ to be unbounded.
It will be shown that this growth rate of $h$ is best possible if one wants convergence of the sequence $x_n$.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In unconstrained optimisation on an Euclidean space, to prove convergence in
Gradient Descent processes (GD) $x_{n+1}=x_n-\delta _n \nabla f(x_n)$ it
usually is required that the learning rates $\delta _n$'s are bounded: $\delta
_n\leq \delta $ for some positive $\delta $. Under this assumption, if the
sequence $x_n$ converges to a critical point $z$, then with large values of $n$
the update will be small because $||x_{n+1}-x_n||\lesssim ||\nabla f(x_n)||$.
This may also force the sequence to converge to a bad minimum. If we can allow,
at least theoretically, that the learning rates $\delta _n$'s are not bounded,
then we may have better convergence to better minima.
A previous joint paper by the author showed convergence for the usual version
of Backtracking GD under very general assumptions on the cost function $f$. In
this paper, we allow the learning rates $\delta _n$ to be unbounded, in the
sense that there is a function $h:(0,\infty)\rightarrow (0,\infty )$ such that
$\lim _{t\rightarrow 0}th(t)=0$ and $\delta _n\lesssim \max \{h(x_n),\delta \}$
satisfies Armijo's condition for all $n$, and prove convergence under the same
assumptions as in the mentioned paper. It will be shown that this growth rate
of $h$ is best possible if one wants convergence of the sequence $\{x_n\}$.
A specific way for choosing $\delta _n$ in a discrete way connects to Two-way
Backtracking GD defined in the mentioned paper. We provide some results which
either improve or are implicitly contained in those in the mentioned paper and
another recent paper on avoidance of saddle points.
Related papers
- Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Efficient Continual Finite-Sum Minimization [52.5238287567572]
We propose a key twist into the finite-sum minimization, dubbed as continual finite-sum minimization.
Our approach significantly improves upon the $mathcalO(n/epsilon)$ FOs that $mathrmStochasticGradientDescent$ requires.
We also prove that there is no natural first-order method with $mathcalOleft(n/epsilonalpharight)$ complexity gradient for $alpha 1/4$, establishing that the first-order complexity of our method is nearly tight.
arXiv Detail & Related papers (2024-06-07T08:26:31Z) - On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [59.65871549878937]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T.
Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$.
Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z) - A note on estimating the dimension from a random geometric graph [2.3020018305241337]
We study the problem of estimating the dimension $d$ of the underlying space when we have access to the adjacency matrix of the graph.
We also show that, without any condition on the density, a consistent estimator of $d$ exists when $n r_nd to infty$ and $r_n = o(1)$.
arXiv Detail & Related papers (2023-11-21T23:46:44Z) - Generalisations and improvements of New Q-Newton's method Backtracking [0.0]
We propose a general framework for the algorithm New Q-Newton's method Backtracking.
In this paper, we allow more flexibility and gradientity, for example $1$ or $e_m(x)$'s are not necessarily eigenvectors of $nabla 2f(x)$.
arXiv Detail & Related papers (2021-09-23T14:28:15Z) - Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure.
We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds.
We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z) - Asymptotic behaviour of learning rates in Armijo's condition [0.0]
We show that if $x_n$ converges to a non-degenerate critical point, then $delta _n$ must be bounded.
This complements the first author's results on Unbounded Backtracking GD.
arXiv Detail & Related papers (2020-07-07T16:49:25Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z) - A fast and simple modification of Newton's method helping to avoid
saddle points [0.0]
This paper roughly says that if $f$ is $C3$ and a sequence $x_n$, then the limit point is a critical point and is not a saddle point, and the convergence rate is the same as that of Newton's method.
arXiv Detail & Related papers (2020-06-02T10:38:00Z) - On the Complexity of Minimizing Convex Finite Sums Without Using the
Indices of the Individual Functions [62.01594253618911]
We exploit the finite noise structure of finite sums to derive a matching $O(n2)$-upper bound under the global oracle model.
Following a similar approach, we propose a novel adaptation of SVRG which is both emphcompatible with oracles, and achieves complexity bounds of $tildeO(n2+nsqrtL/mu)log (1/epsilon)$ and $O(nsqrtL/epsilon)$, for $mu>0$ and $mu=0$
arXiv Detail & Related papers (2020-02-09T03:39:46Z) - Some convergent results for Backtracking Gradient Descent method on
Banach spaces [0.0]
bf Theorem. Let $X$ be a Banach space and $f:Xrightarrow mathbbR$ be a $C2$ function.
Let $mathcalC$ be the set of critical points of $f$.
arXiv Detail & Related papers (2020-01-16T12:49:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.