EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models
- URL: http://arxiv.org/abs/2412.07210v2
- Date: Mon, 17 Feb 2025 02:57:12 GMT
- Title: EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models
- Authors: Jialiang Cheng, Ning Gao, Yun Yue, Zhiling Ye, Jiadi Jiang, Jian Sha,
- Abstract summary: We propose EDiT, an innovative Efficient Distributed Training method that combines a tailored Local SGD approach with model sharding techniques to enhance large-scale training efficiency.
We also introduce A-EDiT, a fully asynchronous variant of EDiT that accommodates heterogeneous clusters.
Experimental results demonstrate the superior performance of EDiT/A-EDiT, establishing them as robust solutions for distributed LLM training.
- Score: 4.514681046629978
- License:
- Abstract: Distributed training methods are crucial for large language models (LLMs). However, existing distributed training methods often suffer from communication bottlenecks, stragglers, and limited elasticity, particularly in heterogeneous or large-scale environments. Local SGD methods have been proposed to address these issues, but their effectiveness remains limited to small-scale training due to additional memory overhead and lack of concerns on efficiency and stability. To tackle these issues, we propose EDiT, an innovative Efficient Distributed Training method that combines a tailored Local SGD approach with model sharding techniques to enhance large-scale training efficiency. EDiT performs layer-wise parameter synchronization during forward pass, reducing communication and memory overhead and enabling overlap. Besides, EDiT employs a pseudo gradient penalty strategy to suppress loss spikes, which ensures training stability and improves performance. Additionally, we introduce A-EDiT, a fully asynchronous variant of EDiT that accommodates heterogeneous clusters. Building on EDiT/A-EDiT, we conduct a series of experiments to validate large-scale asynchronous training for LLMs, accompanied by comprehensive analyses. Experimental results demonstrate the superior performance of EDiT/A-EDiT, establishing them as robust solutions for distributed LLM training in diverse computational ecosystems. The code is available at Atorch codebase: https://github.com/intelligent-machine-learning/atorch/tree/main/atorch/local_sgd.
Related papers
- DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.
We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.
Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z) - PFedDST: Personalized Federated Learning with Decentralized Selection Training [8.21688083335571]
We introduce the Personalized Federated Learning with Decentralized Selection Training (PFedDST) framework.
PFedDST enhances model training by allowing devices to strategically evaluate and select peers based on a comprehensive communication score.
Our experiments demonstrate that PFedDST not only enhances model accuracy but also accelerates convergence.
arXiv Detail & Related papers (2025-02-11T18:25:48Z) - Over-the-Air Fair Federated Learning via Multi-Objective Optimization [52.295563400314094]
We propose an over-the-air fair federated learning algorithm (OTA-FFL) to train fair FL models.
Experiments demonstrate the superiority of OTA-FFL in achieving fairness and robust performance.
arXiv Detail & Related papers (2025-01-06T21:16:51Z) - A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning [3.0468273116892752]
This paper investigates various optimization techniques in distributed deep learning.
We propose a dynamic weighting strategy to mitigate the problem of straggler nodes due to failure.
arXiv Detail & Related papers (2024-09-14T00:46:51Z) - PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation [61.57833648734164]
We propose a novel Parallel Yielding Re-Activation (PYRA) method for training-inference efficient task adaptation.
PYRA outperforms all competing methods under both low compression rate and high compression rate.
arXiv Detail & Related papers (2024-03-14T09:06:49Z) - Towards Robust Federated Learning via Logits Calibration on Non-IID Data [49.286558007937856]
Federated learning (FL) is a privacy-preserving distributed management framework based on collaborative model training of distributed devices in edge networks.
Recent studies have shown that FL is vulnerable to adversarial examples, leading to a significant drop in its performance.
In this work, we adopt the adversarial training (AT) framework to improve the robustness of FL models against adversarial example (AE) attacks.
arXiv Detail & Related papers (2024-03-05T09:18:29Z) - FedLALR: Client-Specific Adaptive Learning Rates Achieve Linear Speedup
for Non-IID Data [54.81695390763957]
Federated learning is an emerging distributed machine learning method.
We propose a heterogeneous local variant of AMSGrad, named FedLALR, in which each client adjusts its learning rate.
We show that our client-specified auto-tuned learning rate scheduling can converge and achieve linear speedup with respect to the number of clients.
arXiv Detail & Related papers (2023-09-18T12:35:05Z) - Accelerating Federated Edge Learning via Topology Optimization [41.830942005165625]
Federated edge learning (FEEL) is envisioned as a promising paradigm to achieve privacy-preserving distributed learning.
It consumes excessive learning time due to the existence of straggler devices.
A novel topology-optimized federated edge learning (TOFEL) scheme is proposed to tackle the heterogeneity issue in federated learning.
arXiv Detail & Related papers (2022-04-01T14:49:55Z) - Semi-Decentralized Federated Edge Learning with Data and Device
Heterogeneity [6.341508488542275]
Federated edge learning (FEEL) has attracted much attention as a privacy-preserving paradigm to effectively incorporate the distributed data at the network edge for training deep learning models.
In this paper, we investigate a novel framework of FEEL, namely semi-decentralized federated edge learning (SD-FEEL), where multiple edge servers are employed to collectively coordinate a large number of client nodes.
By exploiting the low-latency communication among edge servers for efficient model sharing, SD-FEEL can incorporate more training data, while enjoying much lower latency compared with conventional federated learning.
arXiv Detail & Related papers (2021-12-20T03:06:08Z) - Semi-Decentralized Federated Edge Learning for Fast Convergence on Non-IID Data [14.269800282001464]
Federated edge learning (FEEL) has emerged as an effective approach to reduce the large communication latency in Cloud-based machine learning solutions.
We investigate a novel framework of FEEL, namely semi-decentralized federated edge learning (SD-FEEL)
By allowing model aggregation across different edge clusters, SD-FEEL enjoys the benefit of FEEL in reducing the training latency.
arXiv Detail & Related papers (2021-04-26T16:11:47Z) - Adaptive Serverless Learning [114.36410688552579]
We propose a novel adaptive decentralized training approach, which can compute the learning rate from data dynamically.
Our theoretical results reveal that the proposed algorithm can achieve linear speedup with respect to the number of workers.
To reduce the communication-efficient overhead, we further propose a communication-efficient adaptive decentralized training approach.
arXiv Detail & Related papers (2020-08-24T13:23:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.