Escaping Saddle Points Efficiently with Occupation-Time-Adapted
Perturbations
- URL: http://arxiv.org/abs/2005.04507v3
- Date: Sun, 27 Mar 2022 02:55:54 GMT
- Title: Escaping Saddle Points Efficiently with Occupation-Time-Adapted
Perturbations
- Authors: Xin Guo, Jiequn Han, Mahan Tajrobehkar, Wenpin Tang
- Abstract summary: Motivated by the super-diffusivity of self-repelling random walk, this paper develops a new perturbation mechanism for optimization algorithms.
Two new algorithms are proposed: perturbed gradient descent adapted to occupation time and its accelerated version.
- Score: 12.251606057991237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by the super-diffusivity of self-repelling random walk, which has
roots in statistical physics, this paper develops a new perturbation mechanism
for optimization algorithms. In this mechanism, perturbations are adapted to
the history of states via the notion of occupation time. After integrating this
mechanism into the framework of perturbed gradient descent (PGD) and perturbed
accelerated gradient descent (PAGD), two new algorithms are proposed: perturbed
gradient descent adapted to occupation time (PGDOT) and its accelerated version
(PAGDOT). PGDOT and PAGDOT are shown to converge to second-order stationary
points at least as fast as PGD and PAGD, respectively, and thus they are
guaranteed to avoid getting stuck at non-degenerate saddle points. The
theoretical analysis is corroborated by empirical studies in which the new
algorithms consistently escape saddle points and outperform not only their
counterparts, PGD and PAGD, but also other popular alternatives including
stochastic gradient descent, Adam, AMSGrad, and RMSProp.
Related papers
- Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Rethinking PGD Attack: Is Sign Function Necessary? [131.6894310945647]
We present a theoretical analysis of how such sign-based update algorithm influences step-wise attack performance.
We propose a new raw gradient descent (RGD) algorithm that eliminates the use of sign.
The effectiveness of the proposed RGD algorithm has been demonstrated extensively in experiments.
arXiv Detail & Related papers (2023-12-03T02:26:58Z) - Model-Based Reparameterization Policy Gradient Methods: Theory and
Practical Algorithms [88.74308282658133]
Reization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics.
Recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes.
We propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls.
arXiv Detail & Related papers (2023-10-30T18:43:21Z) - Distributionally Robust Optimization with Bias and Variance Reduction [9.341215359733601]
We show that Prospect, a gradient-based algorithm, enjoys linear convergence for smooth regularized losses.
We also show that Prospect can converge 2-3$times$ faster than baselines such as gradient-based methods.
arXiv Detail & Related papers (2023-10-21T00:03:54Z) - Byzantine-Robust Decentralized Stochastic Optimization with Stochastic
Gradient Noise-Independent Learning Error [25.15075119957447]
We study Byzantine-robust optimization over a decentralized network, where every agent periodically communicates with its neighbors to exchange local models, and then updates its own local model by gradient descent (SGD)
The performance of such a method is affected by an unknown number of Byzantine agents, which conduct adversarially during the optimization process.
arXiv Detail & Related papers (2023-08-10T02:14:23Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - Noise and Fluctuation of Finite Learning Rate Stochastic Gradient
Descent [3.0079490585515343]
gradient descent (SGD) is relatively well understood in the vanishing learning rate regime.
We propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime.
arXiv Detail & Related papers (2020-12-07T12:31:43Z) - GTAdam: Gradient Tracking with Adaptive Momentum for Distributed Online
Optimization [4.103281325880475]
This paper deals with a network of computing agents aiming to solve an online optimization problem in a distributed fashion, by means of local computation and communication, without any central coordinator.
We propose the gradient tracking with adaptive momentum estimation (GTAdam) distributed algorithm, which combines a gradient tracking mechanism with first and second order momentum estimates of the gradient.
In these numerical experiments from multi-agent learning, GTAdam outperforms state-of-the-art distributed optimization methods.
arXiv Detail & Related papers (2020-09-03T15:20:21Z) - Variance Reduction for Deep Q-Learning using Stochastic Recursive
Gradient [51.880464915253924]
Deep Q-learning algorithms often suffer from poor gradient estimations with an excessive variance.
This paper introduces the framework for updating the gradient estimates in deep Q-learning, achieving a novel algorithm called SRG-DQN.
arXiv Detail & Related papers (2020-07-25T00:54:20Z) - Bounding the expected run-time of nonconvex optimization with early
stopping [2.7648976108201815]
This work examines the convergence of gradient-based optimization algorithms that use early stopping based on a validation function.
We derive conditions that guarantee this stopping rule is well-defined, and provide bounds on the expected number of iterations and gradient evaluations needed to meet this criterion.
arXiv Detail & Related papers (2020-02-20T16:43:37Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.