Escaping Saddle Points with Bias-Variance Reduced Local Perturbed SGD
for Communication Efficient Nonconvex Distributed Learning
- URL: http://arxiv.org/abs/2202.06083v1
- Date: Sat, 12 Feb 2022 15:12:17 GMT
- Title: Escaping Saddle Points with Bias-Variance Reduced Local Perturbed SGD
for Communication Efficient Nonconvex Distributed Learning
- Authors: Tomoya Murata and Taiji Suzuki
- Abstract summary: Local methods are one of the promising approaches to reduce communication time.
We show that the communication complexity is better than non-local methods when the local datasets is smaller than the smoothness local loss.
- Score: 58.79085525115987
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent centralized nonconvex distributed learning and federated learning,
local methods are one of the promising approaches to reduce communication time.
However, existing work has mainly focused on studying first-order optimality
guarantees. On the other side, second-order optimality guaranteed algorithms
have been extensively studied in the non-distributed optimization literature.
In this paper, we study a new local algorithm called Bias-Variance Reduced
Local Perturbed SGD (BVR-L-PSGD), that combines the existing bias-variance
reduced gradient estimator with parameter perturbation to find second-order
optimal points in centralized nonconvex distributed optimization. BVR-L-PSGD
enjoys second-order optimality with nearly the same communication complexity as
the best known one of BVR-L-SGD to find first-order optimality. Particularly,
the communication complexity is better than non-local methods when the local
datasets heterogeneity is smaller than the smoothness of the local loss. In an
extreme case, the communication complexity approaches to $\widetilde \Theta(1)$
when the local datasets heterogeneity goes to zero.
Related papers
- The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication [37.210933391984014]
Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice.
We provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions.
We also show the min-max optimality of accelerated mini-batch SGD for several problem classes.
arXiv Detail & Related papers (2024-05-19T20:20:03Z) - Federated Minimax Optimization: Improved Convergence Analyses and
Algorithms [32.062312674333775]
We consider non minimax optimization, is gaining prominence many modern machine learning applications such as GANs.
We provide a novel and tighter analysis algorithm, improves convergence communication guarantees in the existing literature.
arXiv Detail & Related papers (2022-03-09T16:21:31Z) - The Minimax Complexity of Distributed Optimization [0.0]
I present the "graph oracle model", an extension of the classic oracle framework that can be applied to distributed optimization.
I focus on the specific case of the "intermittent communication setting"
I analyze the theoretical properties of the popular Local Descent (SGD) algorithm in convex setting.
arXiv Detail & Related papers (2021-09-01T15:18:33Z) - Local AdaGrad-Type Algorithm for Stochastic Convex-Concave Minimax
Problems [80.46370778277186]
Large scale convex-concave minimax problems arise in numerous applications, including game theory, robust training, and training of generative adversarial networks.
We develop a communication-efficient distributed extragrad algorithm, LocalAdaSient, with an adaptive learning rate suitable for solving convex-concave minimax problem in the.
Server model.
We demonstrate its efficacy through several experiments in both the homogeneous and heterogeneous settings.
arXiv Detail & Related papers (2021-06-18T09:42:05Z) - Local Stochastic Gradient Descent Ascent: Convergence Analysis and
Communication Efficiency [15.04034188283642]
Local SGD is a promising approach to overcome the communication overhead in distributed learning.
We show that local SGDA can provably optimize distributed minimax problems in both homogeneous and heterogeneous data.
arXiv Detail & Related papers (2021-02-25T20:15:18Z) - Bias-Variance Reduced Local SGD for Less Heterogeneous Federated
Learning [46.32232395989181]
We aim at learning local efficiently in terms of communication and computational complexity.
One of the important learning scenarios in distributed learning is the Federated Learning scenario.
arXiv Detail & Related papers (2021-02-05T14:32:28Z) - FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity
to Non-IID Data [59.50904660420082]
Federated Learning (FL) has become a popular paradigm for learning from distributed data.
To effectively utilize data at different devices without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a "computation then aggregation" (CTA) model.
arXiv Detail & Related papers (2020-05-22T23:07:42Z) - Second-Order Guarantees in Centralized, Federated and Decentralized
Nonconvex Optimization [64.26238893241322]
Simple algorithms have been shown to lead to good empirical results in many contexts.
Several works have pursued rigorous analytical justification for studying non optimization problems.
A key insight in these analyses is that perturbations play a critical role in allowing local descent algorithms.
arXiv Detail & Related papers (2020-03-31T16:54:22Z) - A Unified Theory of Decentralized SGD with Changing Topology and Local
Updates [70.9701218475002]
We introduce a unified convergence analysis of decentralized communication methods.
We derive universal convergence rates for several applications.
Our proofs rely on weak assumptions.
arXiv Detail & Related papers (2020-03-23T17:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.