Communication Efficient Distributed Training with Distributed Lion
- URL: http://arxiv.org/abs/2404.00438v1
- Date: Sat, 30 Mar 2024 18:07:29 GMT
- Title: Communication Efficient Distributed Training with Distributed Lion
- Authors: Bo Liu, Lemeng Wu, Lizhang Chen, Kaizhao Liang, Jiaxu Zhu, Chen Liang, Raghuraman Krishnamoorthi, Qiang Liu,
- Abstract summary: We introduce Distributed Lion, an innovative adaptation of Lion for distributed training environments.
We demonstrate its robustness across a range of tasks, worker counts, and batch sizes, on both vision and language problems.
- Score: 25.39333175634972
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The Lion optimizer has been a promising competitor with the AdamW for training large AI models, with advantages on memory, computation, and sample efficiency. In this paper, we introduce Distributed Lion, an innovative adaptation of Lion for distributed training environments. Leveraging the sign operator in Lion, our Distributed Lion only requires communicating binary or lower-precision vectors between workers to the center server, significantly reducing the communication cost. Our theoretical analysis confirms Distributed Lion's convergence properties. Empirical results demonstrate its robustness across a range of tasks, worker counts, and batch sizes, on both vision and language problems. Notably, Distributed Lion attains comparable performance to standard Lion or AdamW optimizers applied on aggregated gradients, but with significantly reduced communication bandwidth. This feature is particularly advantageous for training large models. In addition, we also demonstrate that Distributed Lion presents a more favorable performance-bandwidth balance compared to existing efficient distributed methods such as deep gradient compression and ternary gradients.
Related papers
- Over-the-Air Fair Federated Learning via Multi-Objective Optimization [52.295563400314094]
We propose an over-the-air fair federated learning algorithm (OTA-FFL) to train fair FL models.
Experiments demonstrate the superiority of OTA-FFL in achieving fairness and robust performance.
arXiv Detail & Related papers (2025-01-06T21:16:51Z) - Lion Cub: Minimizing Communication Overhead in Distributed Lion [9.360174471655977]
Communication overhead is a key challenge in distributed deep learning, especially on slower Ethernet interconnects.
We analyze three factors critical to distributed learning with Lion: optimizing communication methods, identifying effective quantization methods, and assessing the necessity of momentum synchronization.
We combine these into Lion Cub, which enables up to 5x speedups in end-to-end training compared to Lion.
arXiv Detail & Related papers (2024-11-25T15:08:24Z) - LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss.
We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality.
Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z) - Sparse-ProxSkip: Accelerated Sparse-to-Sparse Training in Federated Learning [56.21666819468249]
In Federated Learning (FL), both client resource constraints and communication costs pose major problems for training large models.
Recent work has shown that local training provably improves communication complexity through acceleration.
We introduce Sparse-ProxSkip, addressing the issue and implementing the efficient technique of Straight-Through Estorimat pruning into sparse training.
arXiv Detail & Related papers (2024-05-31T05:21:12Z) - Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples.
Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance.
We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z) - Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts [8.393403749426097]
Lion (Evolved Sign Momentum) has shown promising results in training large AI models.
It performs comparably or favorably to AdamW but with greater memory efficiency.
Our analysis is made possible by the development of a new Lyapunov function for the Lion updates.
arXiv Detail & Related papers (2023-10-09T17:41:29Z) - LION: Implicit Vision Prompt Tuning [95.71880718928439]
We propose an efficient vision model named impLicit vIsion prOmpt tuNing (LION)
LION is motivated by deep implicit models with stable memory costs for various complex tasks.
The performance obtained by our LION are promising on a wide range of datasets.
arXiv Detail & Related papers (2023-03-17T14:07:55Z) - Symbolic Discovery of Optimization Algorithms [132.62397077095787]
We use efficient search techniques to explore an infinite and sparse program space.
Our method discovers a simple and effective optimization algorithm, $textbfLion$.
Lion is successfully deployed in production systems such as Google search ads CTR model.
arXiv Detail & Related papers (2023-02-13T20:27:30Z) - Predictive GAN-powered Multi-Objective Optimization for Hybrid Federated
Split Learning [56.125720497163684]
We propose a hybrid federated split learning framework in wireless networks.
We design a parallel computing scheme for model splitting without label sharing, and theoretically analyze the influence of the delayed gradient caused by the scheme on the convergence speed.
arXiv Detail & Related papers (2022-09-02T10:29:56Z) - Regularization via Adaptive Pairwise Label Smoothing [19.252319300590653]
This paper introduces a novel label smoothing technique called Pairwise Label Smoothing (PLS)
Unlike current LS methods, which typically require to find a global smoothing distribution mass through cross-validation search, PLS automatically learns the distribution mass for each input pair during training.
We empirically show that PLS significantly outperforms LS and the baseline models, achieving up to 30% of relative classification error reduction.
arXiv Detail & Related papers (2020-12-02T22:08:10Z) - Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep
Learning [61.29990368322931]
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors.
Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
arXiv Detail & Related papers (2020-08-27T16:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.