Related papers: Escaping Saddle Points with Bias-Variance Reduced Local Perturbed SGD for Communication Efficient Nonconvex Distributed Learning

Escaping Saddle Points with Bias-Variance Reduced Local Perturbed SGD for Communication Efficient Nonconvex Distributed Learning

URL: http://arxiv.org/abs/2202.06083v1
Date: Sat, 12 Feb 2022 15:12:17 GMT
Title: Escaping Saddle Points with Bias-Variance Reduced Local Perturbed SGD for Communication Efficient Nonconvex Distributed Learning
Authors: Tomoya Murata and Taiji Suzuki
Abstract summary: Local methods are one of the promising approaches to reduce communication time. We show that the communication complexity is better than non-local methods when the local datasets is smaller than the smoothness local loss.
Score: 58.79085525115987
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent centralized nonconvex distributed learning and federated learning, local methods are one of the promising approaches to reduce communication time. However, existing work has mainly focused on studying first-order optimality guarantees. On the other side, second-order optimality guaranteed algorithms have been extensively studied in the non-distributed optimization literature. In this paper, we study a new local algorithm called Bias-Variance Reduced Local Perturbed SGD (BVR-L-PSGD), that combines the existing bias-variance reduced gradient estimator with parameter perturbation to find second-order optimal points in centralized nonconvex distributed optimization. BVR-L-PSGD enjoys second-order optimality with nearly the same communication complexity as the best known one of BVR-L-SGD to find first-order optimality. Particularly, the communication complexity is better than non-local methods when the local datasets heterogeneity is smaller than the smoothness of the local loss. In an extreme case, the communication complexity approaches to $\widetilde \Theta(1)$ when the local datasets heterogeneity goes to zero.

Related papers

Stability and Generalization for Distributed SGDA [70.97400503482353]
We propose the stability-based generalization analytical framework for Distributed-SGDA. We conduct a comprehensive analysis of stability error, generalization gap, and population risk across different metrics. Our theoretical results reveal the trade-off between the generalization gap and optimization error.
arXiv Detail & Related papers (2024-11-14T11:16:32Z)
The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication [37.210933391984014]
Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice. We provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions. We also show the min-max optimality of accelerated mini-batch SGD for several problem classes.
arXiv Detail & Related papers (2024-05-19T20:20:03Z)
Federated Minimax Optimization: Improved Convergence Analyses and Algorithms [32.062312674333775]
We consider non minimax optimization, is gaining prominence many modern machine learning applications such as GANs. We provide a novel and tighter analysis algorithm, improves convergence communication guarantees in the existing literature.
arXiv Detail & Related papers (2022-03-09T16:21:31Z)
The Minimax Complexity of Distributed Optimization [0.0]
I present the "graph oracle model", an extension of the classic oracle framework that can be applied to distributed optimization. I focus on the specific case of the "intermittent communication setting" I analyze the theoretical properties of the popular Local Descent (SGD) algorithm in convex setting.
arXiv Detail & Related papers (2021-09-01T15:18:33Z)
Local AdaGrad-Type Algorithm for Stochastic Convex-Concave Minimax Problems [80.46370778277186]
Large scale convex-concave minimax problems arise in numerous applications, including game theory, robust training, and training of generative adversarial networks. We develop a communication-efficient distributed extragrad algorithm, LocalAdaSient, with an adaptive learning rate suitable for solving convex-concave minimax problem in the. Server model. We demonstrate its efficacy through several experiments in both the homogeneous and heterogeneous settings.
arXiv Detail & Related papers (2021-06-18T09:42:05Z)
Local Stochastic Gradient Descent Ascent: Convergence Analysis and Communication Efficiency [15.04034188283642]
Local SGD is a promising approach to overcome the communication overhead in distributed learning. We show that local SGDA can provably optimize distributed minimax problems in both homogeneous and heterogeneous data.
arXiv Detail & Related papers (2021-02-25T20:15:18Z)
Bias-Variance Reduced Local SGD for Less Heterogeneous Federated Learning [46.32232395989181]
We aim at learning local efficiently in terms of communication and computational complexity. One of the important learning scenarios in distributed learning is the Federated Learning scenario.
arXiv Detail & Related papers (2021-02-05T14:32:28Z)
FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity to Non-IID Data [59.50904660420082]
Federated Learning (FL) has become a popular paradigm for learning from distributed data. To effectively utilize data at different devices without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a "computation then aggregation" (CTA) model.
arXiv Detail & Related papers (2020-05-22T23:07:42Z)
Second-Order Guarantees in Centralized, Federated and Decentralized Nonconvex Optimization [64.26238893241322]
Simple algorithms have been shown to lead to good empirical results in many contexts. Several works have pursued rigorous analytical justification for studying non optimization problems. A key insight in these analyses is that perturbations play a critical role in allowing local descent algorithms.
arXiv Detail & Related papers (2020-03-31T16:54:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.