A Hybrid-Order Distributed SGD Method for Non-Convex Optimization to
Balance Communication Overhead, Computational Complexity, and Convergence
Rate
- URL: http://arxiv.org/abs/2003.12423v1
- Date: Fri, 27 Mar 2020 14:02:15 GMT
- Title: A Hybrid-Order Distributed SGD Method for Non-Convex Optimization to
Balance Communication Overhead, Computational Complexity, and Convergence
Rate
- Authors: Naeimeh Omidvar, Mohammad Ali Maddah-Ali, Hamed Mahdavi
- Abstract summary: We propose a method of distributed gradient descent (SGD) with low communication load and computational complexity, and still fast.
To reduce the computational complexity in each iteration, the worker nodes approximate the directional derivatives with zeroth-order gradient estimation.
- Score: 28.167294398293297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a method of distributed stochastic gradient descent
(SGD), with low communication load and computational complexity, and still fast
convergence. To reduce the communication load, at each iteration of the
algorithm, the worker nodes calculate and communicate some scalers, that are
the directional derivatives of the sample functions in some \emph{pre-shared
directions}. However, to maintain accuracy, after every specific number of
iterations, they communicate the vectors of stochastic gradients. To reduce the
computational complexity in each iteration, the worker nodes approximate the
directional derivatives with zeroth-order stochastic gradient estimation, by
performing just two function evaluations rather than computing a first-order
gradient vector. The proposed method highly improves the convergence rate of
the zeroth-order methods, guaranteeing order-wise faster convergence. Moreover,
compared to the famous communication-efficient methods of model averaging (that
perform local model updates and periodic communication of the gradients to
synchronize the local models), we prove that for the general class of
non-convex stochastic problems and with reasonable choice of parameters, the
proposed method guarantees the same orders of communication load and
convergence rate, while having order-wise less computational complexity.
Experimental results on various learning problems in neural networks
applications demonstrate the effectiveness of the proposed approach compared to
various state-of-the-art distributed SGD methods.
Related papers
- Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient [40.22217106270146]
Variance reduction techniques are designed to decrease the sampling variance, thereby accelerating convergence rates of first-order (FO) and zeroth-order (ZO) optimization methods.
In composite optimization problems, ZO methods encounter an additional variance called the coordinate-wise variance, which stems from the random estimation.
This paper proposes the Zeroth-order Proximal Double Variance Reduction (ZPDVR) method, which utilizes the averaging trick to reduce both sampling and coordinate-wise variances.
arXiv Detail & Related papers (2024-05-28T02:27:53Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - Fast Computation of Optimal Transport via Entropy-Regularized Extragradient Methods [75.34939761152587]
Efficient computation of the optimal transport distance between two distributions serves as an algorithm that empowers various applications.
This paper develops a scalable first-order optimization-based method that computes optimal transport to within $varepsilon$ additive accuracy.
arXiv Detail & Related papers (2023-01-30T15:46:39Z) - Faster One-Sample Stochastic Conditional Gradient Method for Composite
Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms.
The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z) - A Stochastic Newton Algorithm for Distributed Convex Optimization [62.20732134991661]
We analyze a Newton algorithm for homogeneous distributed convex optimization, where each machine can calculate gradients of the same population objective.
We show that our method can reduce the number, and frequency, of required communication rounds compared to existing methods without hurting performance.
arXiv Detail & Related papers (2021-10-07T17:51:10Z) - On Accelerating Distributed Convex Optimizations [0.0]
This paper studies a distributed multi-agent convex optimization problem.
We show that the proposed algorithm converges linearly with an improved rate of convergence than the traditional and adaptive gradient-descent methods.
We demonstrate our algorithm's superior performance compared to prominent distributed algorithms for solving real logistic regression problems.
arXiv Detail & Related papers (2021-08-19T13:19:54Z) - Asynchronous Distributed Reinforcement Learning for LQR Control via Zeroth-Order Block Coordinate Descent [7.6860514640178]
We propose a novel zeroth-order optimization algorithm for distributed reinforcement learning.
It allows each agent to estimate its local gradient by cost evaluation independently, without use of any consensus protocol.
arXiv Detail & Related papers (2021-07-26T18:11:07Z) - Iterative Pre-Conditioning for Expediting the Gradient-Descent Method:
The Distributed Linear Least-Squares Problem [0.966840768820136]
This paper considers the multi-agent linear least-squares problem in a server-agent network.
The goal for the agents is to compute a linear model that optimally fits the collective data points held by all the agents, without sharing their individual local data points.
We propose an iterative pre-conditioning technique that mitigates the deleterious effect of the conditioning of data points on the rate of convergence of the gradient-descent method.
arXiv Detail & Related papers (2020-08-06T20:01:18Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients.
We introduce a new sparsity operator: The random-top-k operator.
Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.