Related papers: Lion Cub: Minimizing Communication Overhead in Distributed Lion

Lion Cub: Minimizing Communication Overhead in Distributed Lion

URL: http://arxiv.org/abs/2411.16462v1
Date: Mon, 25 Nov 2024 15:08:24 GMT
Title: Lion Cub: Minimizing Communication Overhead in Distributed Lion
Authors: Satoki Ishikawa, Tal Ben-Nun, Brian Van Essen, Rio Yokota, Nikoli Dryden,
Abstract summary: Communication overhead is a key challenge in distributed deep learning, especially on slower Ethernet interconnects. We analyze three factors critical to distributed learning with Lion: optimizing communication methods, identifying effective quantization methods, and assessing the necessity of momentum synchronization. We combine these into Lion Cub, which enables up to 5x speedups in end-to-end training compared to Lion.
Score: 9.360174471655977
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Communication overhead is a key challenge in distributed deep learning, especially on slower Ethernet interconnects, and given current hardware trends, communication is likely to become a major bottleneck. While gradient compression techniques have been explored for SGD and Adam, the Lion optimizer has the distinct advantage that its update vectors are the output of a sign operation, enabling straightforward quantization. However, simply compressing updates for communication and using techniques like majority voting fails to lead to end-to-end speedups due to inefficient communication algorithms and reduced convergence. We analyze three factors critical to distributed learning with Lion: optimizing communication methods, identifying effective quantization methods, and assessing the necessity of momentum synchronization. Our findings show that quantization techniques adapted to Lion and selective momentum synchronization can significantly reduce communication costs while maintaining convergence. We combine these into Lion Cub, which enables up to 5x speedups in end-to-end training compared to Lion. This highlights Lion's potential as a communication-efficient solution for distributed training.

Related papers

Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference [14.805702987440512]
We introduce Flash Communication, a novel low-bit compression technique designed to alleviate the tensor-parallelism communication bottleneck during inference. Our method substantially boosts intra-node communication speed by more than 3x and reduces the time-to-first-token by 2x, with nearly no sacrifice in model accuracy.
arXiv Detail & Related papers (2024-12-06T11:29:32Z)
High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model. In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z)
Sparse-ProxSkip: Accelerated Sparse-to-Sparse Training in Federated Learning [56.21666819468249]
In Federated Learning (FL), both client resource constraints and communication costs pose major problems for training large models. Recent work has shown that local training provably improves communication complexity through acceleration. We introduce Sparse-ProxSkip, addressing the issue and implementing the efficient technique of Straight-Through Estorimat pruning into sparse training.
arXiv Detail & Related papers (2024-05-31T05:21:12Z)
Communication Efficient Distributed Training with Distributed Lion [25.39333175634972]
We introduce Distributed Lion, an innovative adaptation of Lion for distributed training environments. We demonstrate its robustness across a range of tasks, worker counts, and batch sizes, on both vision and language problems.
arXiv Detail & Related papers (2024-03-30T18:07:29Z)
FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models [56.21666819468249]
Federated Learning (FL) has garnered increasing attention due to its unique characteristic of allowing heterogeneous clients to process their private data locally and interact with a central server. We introduce FedComLoc, integrating practical and effective compression into emphScaffnew to further enhance communication efficiency.
arXiv Detail & Related papers (2024-03-14T22:29:59Z)
LoCoDL: Communication-Efficient Distributed Learning with Local Training and Compression [56.01900711954956]
We introduce LoCoDL, a communication-efficient algorithm that leverages the two popular and effective techniques of Local training, which reduces the communication frequency, and Compression, in which short bitstreams are sent instead of full-dimensional vectors of floats. LoCoDL provably benefits from local training and compression and enjoys a doubly-accelerated communication complexity, with respect to the condition number of the functions and the model dimension, in the general heterogenous regime with strongly convex functions.
arXiv Detail & Related papers (2024-03-07T09:22:50Z)
Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts [8.393403749426097]
Lion (Evolved Sign Momentum) has shown promising results in training large AI models. It performs comparably or favorably to AdamW but with greater memory efficiency. Our analysis is made possible by the development of a new Lyapunov function for the Lion updates.
arXiv Detail & Related papers (2023-10-09T17:41:29Z)
Predictive GAN-powered Multi-Objective Optimization for Hybrid Federated Split Learning [56.125720497163684]
We propose a hybrid federated split learning framework in wireless networks. We design a parallel computing scheme for model splitting without label sharing, and theoretically analyze the influence of the delayed gradient caused by the scheme on the convergence speed.
arXiv Detail & Related papers (2022-09-02T10:29:56Z)
Communication-Efficient Federated Learning via Predictive Coding [38.778944321534084]
Federated learning can enable remote workers to collaboratively train a shared machine learning model. The communication overhead is a critical bottleneck due to limited power and bandwidth. We propose a predictive coding based communication scheme for federated learning.
arXiv Detail & Related papers (2021-08-02T14:12:19Z)
Accelerating Federated Edge Learning via Optimized Probabilistic Device Scheduling [57.271494741212166]
This paper formulates and solves the communication time minimization problem. It is found that the optimized policy gradually turns its priority from suppressing the remaining communication rounds to reducing per-round latency as the training process evolves. The effectiveness of the proposed scheme is demonstrated via a use case on collaborative 3D objective detection in autonomous driving.
arXiv Detail & Related papers (2021-07-24T11:39:17Z)
Communication-Efficient Split Learning Based on Analog Communication and Over the Air Aggregation [48.150466900765316]
Split-learning (SL) has recently gained popularity due to its inherent privacy-preserving capabilities and ability to enable collaborative inference for devices with limited computational power. Standard SL algorithms assume an ideal underlying digital communication system and ignore the problem of scarce communication bandwidth. We propose a novel SL framework to solve the remote inference problem that introduces an additional layer at the agent side and constrains the choices of the weights and the biases to ensure over the air aggregation.
arXiv Detail & Related papers (2021-06-02T07:49:41Z)
Adaptive Quantization of Model Updates for Communication-Efficient Federated Learning [75.45968495410047]
Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning. Gradient quantization is an effective way of reducing the number of bits required to communicate each model update. We propose an adaptive quantization strategy called AdaFL that aims to achieve communication efficiency as well as a low error floor.
arXiv Detail & Related papers (2021-02-08T19:14:21Z)
Distributed Sparse SGD with Majority Voting [5.32836690371986]
We introduce a majority voting based sparse communication strategy for distributed learning. We show that it is possible to achieve up to x4000 compression without any loss in the test accuracy.
arXiv Detail & Related papers (2020-11-12T17:06:36Z)
FedAT: A High-Performance and Communication-Efficient Federated Learning System with Asynchronous Tiers [22.59875034596411]
We present FedAT, a novel Federated learning method with Asynchronous Tiers under Non-i.i.d. data. FedAT minimizes the straggler effect with improved convergence speed and test accuracy. Results show that FedAT improves the prediction performance by up to 21.09%, and reduces the communication cost by up to 8.5x, compared to state-of-the-art FL methods.
arXiv Detail & Related papers (2020-10-12T18:38:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.