On Faster Convergence of Scaled Sign Gradient Descent
- URL: http://arxiv.org/abs/2109.01806v1
- Date: Sat, 4 Sep 2021 07:26:21 GMT
- Title: On Faster Convergence of Scaled Sign Gradient Descent
- Authors: Xiuxian Li, Kuo-Yi Lin, Li Li, Yiguang Hong, Jie Chen
- Abstract summary: Communication has been seen as a significant bottleneck in industrial applications over large-scale networks.
This paper investigates faster convergence for a variant of sign-based gradient descent methods.
- Score: 9.523120357431383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Communication has been seen as a significant bottleneck in industrial
applications over large-scale networks. To alleviate the communication burden,
sign-based optimization algorithms have gained popularity recently in both
industrial and academic communities, which is shown to be closely related to
adaptive gradient methods, such as Adam. Along this line, this paper
investigates faster convergence for a variant of sign-based gradient descent,
called scaled signGD, in three cases: 1) the objective function is strongly
convex; 2) the objective function is nonconvex but satisfies the
Polyak-Lojasiewicz (PL) inequality; 3) the gradient is stochastic, called
scaled signGD in this case. For the first two cases, it can be shown that the
scaled signGD converges at a linear rate. For case 3), the algorithm is shown
to converge linearly to a neighborhood of the optimal value when a constant
learning rate is employed, and the algorithm converges at a rate of $O(1/k)$
when using a diminishing learning rate, where $k$ is the iteration number. The
results are also extended to the distributed setting by majority vote in a
parameter-server framework. Finally, numerical experiments on logistic
regression are performed to corroborate the theoretical findings.
Related papers
- Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions.
We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z) - Convergence of Batch Stochastic Gradient Descent Methods with
Approximate Gradients and/or Noisy Measurements: Theory and Computational
Results [0.9900482274337404]
We study convex optimization using a very general formulation called BSGD (Block Gradient Descent)
We establish conditions for BSGD to converge to the global minimum, based on approximation theory.
We show that when approximate gradients are used, BSGD converges while momentum-based methods can diverge.
arXiv Detail & Related papers (2022-09-12T16:23:15Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Provably Efficient Convergence of Primal-Dual Actor-Critic with
Nonlinear Function Approximation [15.319335698574932]
We show the first efficient convergence result with primal-dual actor-critic with a convergence of $mathcalOleft ascent(Nright)Nright)$ under Polyian sampling.
Results on Open GymAI continuous control tasks.
arXiv Detail & Related papers (2022-02-28T15:16:23Z) - A Two-Time-Scale Stochastic Optimization Framework with Applications in
Control and Reinforcement Learning [22.07834608976826]
We study a novel two-time-scale gradient method for solving problems where the samples are generated from a time-varying gradient.
We show that a convergence of $mathcal(k-2/3O)$ is achieved. This is the first time such a result is known at the literature.
arXiv Detail & Related papers (2021-09-29T23:15:23Z) - Scalable Optimal Transport in High Dimensions for Graph Distances,
Embedding Alignment, and More [7.484063729015126]
We propose two effective log-linear time approximations of the cost matrix for optimal transport.
These approximations enable general log-linear time algorithms for entropy-regularized OT that perform well even for the complex, high-dimensional spaces.
For graph distance regression we propose the graph transport network (GTN), which combines graph neural networks (GNNs) with enhanced Sinkhorn.
arXiv Detail & Related papers (2021-07-14T17:40:08Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Learning Frequency Domain Approximation for Binary Neural Networks [68.79904499480025]
We propose to estimate the gradient of sign function in the Fourier frequency domain using the combination of sine functions for training BNNs.
The experiments on several benchmark datasets and neural architectures illustrate that the binary network learned using our method achieves the state-of-the-art accuracy.
arXiv Detail & Related papers (2021-03-01T08:25:26Z) - Global Convergence and Variance-Reduced Optimization for a Class of
Nonconvex-Nonconcave Minimax Problems [39.13924898972611]
Non minimaxewicz problems appear frequently in emerging machine learning applications generative adversarial networks and adversarial learning.
GDA algorithms with constant size can potentially diverge even in the convex setting.
AGDA algorithm converges globally at a rate that attains a sub rate.
arXiv Detail & Related papers (2020-02-22T04:20:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.