Related papers: EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models

EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models

URL: http://arxiv.org/abs/2412.07210v2
Date: Mon, 17 Feb 2025 02:57:12 GMT
Title: EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models
Authors: Jialiang Cheng, Ning Gao, Yun Yue, Zhiling Ye, Jiadi Jiang, Jian Sha,
Abstract summary: We propose EDiT, an innovative Efficient Distributed Training method that combines a tailored Local SGD approach with model sharding techniques to enhance large-scale training efficiency.<n>We also introduce A-EDiT, a fully asynchronous variant of EDiT that accommodates heterogeneous clusters.<n> Experimental results demonstrate the superior performance of EDiT/A-EDiT, establishing them as robust solutions for distributed LLM training.
Score: 4.514681046629978
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Distributed training methods are crucial for large language models (LLMs). However, existing distributed training methods often suffer from communication bottlenecks, stragglers, and limited elasticity, particularly in heterogeneous or large-scale environments. Local SGD methods have been proposed to address these issues, but their effectiveness remains limited to small-scale training due to additional memory overhead and lack of concerns on efficiency and stability. To tackle these issues, we propose EDiT, an innovative Efficient Distributed Training method that combines a tailored Local SGD approach with model sharding techniques to enhance large-scale training efficiency. EDiT performs layer-wise parameter synchronization during forward pass, reducing communication and memory overhead and enabling overlap. Besides, EDiT employs a pseudo gradient penalty strategy to suppress loss spikes, which ensures training stability and improves performance. Additionally, we introduce A-EDiT, a fully asynchronous variant of EDiT that accommodates heterogeneous clusters. Building on EDiT/A-EDiT, we conduct a series of experiments to validate large-scale asynchronous training for LLMs, accompanied by comprehensive analyses. Experimental results demonstrate the superior performance of EDiT/A-EDiT, establishing them as robust solutions for distributed LLM training in diverse computational ecosystems. The code is available at Atorch codebase: https://github.com/intelligent-machine-learning/atorch/tree/main/atorch/local_sgd.

Related papers

Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models [21.16132396642158]
Training stability is a persistent challenge in the pre-training of large language models (LLMs) We propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers.
arXiv Detail & Related papers (2025-02-21T14:49:34Z)
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge. Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z)
PFedDST: Personalized Federated Learning with Decentralized Selection Training [8.21688083335571]
We introduce the Personalized Federated Learning with Decentralized Selection Training (PFedDST) framework. PFedDST enhances model training by allowing devices to strategically evaluate and select peers based on a comprehensive communication score. Our experiments demonstrate that PFedDST not only enhances model accuracy but also accelerates convergence.
arXiv Detail & Related papers (2025-02-11T18:25:48Z)
Over-the-Air Fair Federated Learning via Multi-Objective Optimization [52.295563400314094]
We propose an over-the-air fair federated learning algorithm (OTA-FFL) to train fair FL models. Experiments demonstrate the superiority of OTA-FFL in achieving fairness and robust performance.
arXiv Detail & Related papers (2025-01-06T21:16:51Z)
MelissaDL x Breed: Towards Data-Efficient On-line Supervised Training of Multi-parametric Surrogates with Active Learning [0.0]
We introduce a new active learning method to enhance data-efficiency for on-line surrogate training. The surrogate is trained to predict a given timestep directly with different initial and boundary conditions parameters. Preliminary results for 2D heat PDE demonstrate the potential of this method, called Breed, to improve the generalization capabilities of surrogates.
arXiv Detail & Related papers (2024-10-08T09:52:15Z)
A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning [3.0468273116892752]
This paper investigates various optimization techniques in distributed deep learning. We propose a dynamic weighting strategy to mitigate the problem of straggler nodes due to failure.
arXiv Detail & Related papers (2024-09-14T00:46:51Z)
Parallel Split Learning with Global Sampling [9.57839529462706]
parallel split learning has emerged as a promising derivative of split learning well suited for distributed learning on resource-constrained devices. These challenges include large effective batch sizes, non-independent and identically distributed data, and the straggler effect. We introduce a new method called uniform global sampling to decouple the effective batch size from the number of clients and reduce the mini-batch deviation. Our simulations reveal that our proposed methods enhance model accuracy by up to 34.1% in non-independent and identically distributed settings and reduce the training time in the presence of stragglers by up to 62%.
arXiv Detail & Related papers (2024-07-22T15:41:23Z)
Efficient Ensembles Improve Training Data Attribution [12.180392191924758]
Training data attribution methods aim to quantify the influence of individual data points on model predictions, with broad applications in data-centric AI. Existing methods in this field, which can be categorized as retraining-based and gradient-based methods, have struggled with naive trade-off attribution efficacy. Recent research has shown that augmenting gradient-based methods with ensembles of multiple independently trained models can achieve significantly better attribution.
arXiv Detail & Related papers (2024-05-27T15:58:34Z)
PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation [61.57833648734164]
We propose a novel Parallel Yielding Re-Activation (PYRA) method for training-inference efficient task adaptation. PYRA outperforms all competing methods under both low compression rate and high compression rate.
arXiv Detail & Related papers (2024-03-14T09:06:49Z)
Towards Robust Federated Learning via Logits Calibration on Non-IID Data [49.286558007937856]
Federated learning (FL) is a privacy-preserving distributed management framework based on collaborative model training of distributed devices in edge networks. Recent studies have shown that FL is vulnerable to adversarial examples, leading to a significant drop in its performance. In this work, we adopt the adversarial training (AT) framework to improve the robustness of FL models against adversarial example (AE) attacks.
arXiv Detail & Related papers (2024-03-05T09:18:29Z)
FedLALR: Client-Specific Adaptive Learning Rates Achieve Linear Speedup for Non-IID Data [54.81695390763957]
Federated learning is an emerging distributed machine learning method. We propose a heterogeneous local variant of AMSGrad, named FedLALR, in which each client adjusts its learning rate. We show that our client-specified auto-tuned learning rate scheduling can converge and achieve linear speedup with respect to the number of clients.
arXiv Detail & Related papers (2023-09-18T12:35:05Z)
Accelerating Federated Edge Learning via Topology Optimization [41.830942005165625]
Federated edge learning (FEEL) is envisioned as a promising paradigm to achieve privacy-preserving distributed learning. It consumes excessive learning time due to the existence of straggler devices. A novel topology-optimized federated edge learning (TOFEL) scheme is proposed to tackle the heterogeneity issue in federated learning.
arXiv Detail & Related papers (2022-04-01T14:49:55Z)
Semi-Decentralized Federated Edge Learning with Data and Device Heterogeneity [6.341508488542275]
Federated edge learning (FEEL) has attracted much attention as a privacy-preserving paradigm to effectively incorporate the distributed data at the network edge for training deep learning models. In this paper, we investigate a novel framework of FEEL, namely semi-decentralized federated edge learning (SD-FEEL), where multiple edge servers are employed to collectively coordinate a large number of client nodes. By exploiting the low-latency communication among edge servers for efficient model sharing, SD-FEEL can incorporate more training data, while enjoying much lower latency compared with conventional federated learning.
arXiv Detail & Related papers (2021-12-20T03:06:08Z)
Semi-Decentralized Federated Edge Learning for Fast Convergence on Non-IID Data [14.269800282001464]
Federated edge learning (FEEL) has emerged as an effective approach to reduce the large communication latency in Cloud-based machine learning solutions. We investigate a novel framework of FEEL, namely semi-decentralized federated edge learning (SD-FEEL) By allowing model aggregation across different edge clusters, SD-FEEL enjoys the benefit of FEEL in reducing the training latency.
arXiv Detail & Related papers (2021-04-26T16:11:47Z)
Adaptive Serverless Learning [114.36410688552579]
We propose a novel adaptive decentralized training approach, which can compute the learning rate from data dynamically. Our theoretical results reveal that the proposed algorithm can achieve linear speedup with respect to the number of workers. To reduce the communication-efficient overhead, we further propose a communication-efficient adaptive decentralized training approach.
arXiv Detail & Related papers (2020-08-24T13:23:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.