Meta-learning Optimizers for Communication-Efficient Learning
- URL: http://arxiv.org/abs/2312.02204v2
- Date: Wed, 11 Jun 2025 23:03:11 GMT
- Title: Meta-learning Optimizers for Communication-Efficient Learning
- Authors: Charles-Étienne Joseph, Benjamin Thérien, Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky,
- Abstract summary: Communication-efficient variants of SGD, specifically local SGD, have received a great deal of interest in recent years.<n>Many variants of these approaches have been proposed, but they can sometimes lag behind state-of-the-art adaptives for deep learning.
- Score: 12.640586942181322
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Communication-efficient variants of SGD, specifically local SGD, have received a great deal of interest in recent years. These approaches compute multiple gradient steps locally on each worker, before averaging model parameters, helping relieve the critical communication bottleneck in distributed deep learning training. Although many variants of these approaches have been proposed, they can sometimes lag behind state-of-the-art adaptive optimizers for deep learning. In this work, we investigate if the recent progress in the emerging area of learned optimizers can potentially close this gap in homogeneous data and homogeneous device settings while remaining communication-efficient. Specifically, we meta-learn how to perform global updates given an update from local SGD iterations. Our results demonstrate that learned optimizers can substantially outperform local SGD and its sophisticated variants while maintaining their communication efficiency. Our learned optimizers can even generalize to unseen and much larger datasets and architectures, including ImageNet and ViTs, and to unseen modalities such as language modeling. We therefore show the potential of learned optimizers for improving communication-efficient distributed learning.
Related papers
- Optimizers Qualitatively Alter Solutions And We Should Leverage This [62.662640460717476]
Deep Neural Networks (DNNs) can not guarantee convergence to a unique global minimum of the loss when using only local information, such as SGD.<n>We argue that the community should aim at understanding the biases of already existing methods, as well as aim to build new DNNs with the explicit intent of inducing certain properties of the solution.
arXiv Detail & Related papers (2025-07-16T13:33:31Z) - Efficient Distributed Optimization under Heavy-Tailed Noise [32.96984712007111]
TailOPT is designed to address heavy-tailed noise with potentially gradient variance and local updates.
$Bi2Clip$ performs coordinate-wise clipping at both the inner and outers, achieving adaptive-like performance.
$Bi2Clip$ demonstrates superior performance on several language tasks and models, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2025-02-06T15:47:18Z) - GDSG: Graph Diffusion-based Solution Generator for Optimization Problems in MEC Networks [109.17835015018532]
We present a Graph Diffusion-based Solution Generation (GDSG) method.
This approach is designed to work with suboptimal datasets while converging to the optimal solution large probably.
We build GDSG as a multi-task diffusion model utilizing a Graph Neural Network (GNN) to acquire the distribution of high-quality solutions.
arXiv Detail & Related papers (2024-12-11T11:13:43Z) - High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model.
In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z) - Context-Aware Orchestration of Energy-Efficient Gossip Learning Schemes [8.382766344930157]
We present a distributed training approach based on the combination of Gossip Learning with adaptive optimization of the learning process.
We propose a data-driven approach to OGL management that relies on optimizing in real-time for each node.
Results suggest that our approach is highly efficient and effective in a broad spectrum of network scenarios.
arXiv Detail & Related papers (2024-04-18T09:17:46Z) - FedLALR: Client-Specific Adaptive Learning Rates Achieve Linear Speedup
for Non-IID Data [54.81695390763957]
Federated learning is an emerging distributed machine learning method.
We propose a heterogeneous local variant of AMSGrad, named FedLALR, in which each client adjusts its learning rate.
We show that our client-specified auto-tuned learning rate scheduling can converge and achieve linear speedup with respect to the number of clients.
arXiv Detail & Related papers (2023-09-18T12:35:05Z) - VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles.
We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates.
We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Training With Data Dependent Dynamic Learning Rates [8.833548357664608]
We propose an optimization framework which accounts for difference in loss function characteristics across instances.
Our framework learns a dynamic learning rate for each instance present in the dataset.
We show that our framework can be used for personalization of a machine learning model towards a known targeted data distribution.
arXiv Detail & Related papers (2021-05-27T21:52:29Z) - CosSGD: Nonlinear Quantization for Communication-efficient Federated
Learning [62.65937719264881]
Federated learning facilitates learning across clients without transferring local data on these clients to a central server.
We propose a nonlinear quantization for compressed gradient descent, which can be easily utilized in federated learning.
Our system significantly reduces the communication cost by up to three orders of magnitude, while maintaining convergence and accuracy of the training process.
arXiv Detail & Related papers (2020-12-15T12:20:28Z) - Domain Adaptive Person Re-Identification via Coupling Optimization [58.567492812339566]
Domain adaptive person Re-Identification (ReID) is challenging owing to the domain gap and shortage of annotations on target scenarios.
This paper proposes a coupling optimization method including the Domain-Invariant Mapping (DIM) method and the Global-Local distance Optimization ( GLO)
GLO is designed to train the ReID model with unsupervised setting on the target domain.
arXiv Detail & Related papers (2020-11-06T14:01:03Z) - Jointly Optimizing Dataset Size and Local Updates in Heterogeneous
Mobile Edge Learning [11.191719032853527]
This paper proposes to maximize the accuracy of a distributed machine learning (ML) model trained on learners connected via the resource-constrained wireless edge.
We jointly optimize the number of local/global updates and the task size allocation to minimize the loss while taking into account heterogeneous communication and computation capabilities of each learner.
arXiv Detail & Related papers (2020-06-12T18:19:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.