Related papers: Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

URL: http://arxiv.org/abs/2312.03549v4
Date: Mon, 29 Apr 2024 06:56:25 GMT
Title: Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
Authors: Fei Yang, Shuang Peng, Ning Sun, Fangyu Wang, Yuanyuan Wang, Fu Wu, Jiezhong Qiu, Aimin Pan,
Abstract summary: Large language models (LLMs) such as GPT-3, OPT, and LLaMA have demonstrated remarkable accuracy in a wide range of tasks. Training these models can incur significant expenses, often requiring tens of thousands of GPU for months of continuous operation. We introduce Holmes, a training framework for LLMs that employs thoughtfully crafted data and model parallelism strategies over the heterogeneous NIC environment.
Score: 8.30319294116657
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) such as GPT-3, OPT, and LLaMA have demonstrated remarkable accuracy in a wide range of tasks. However, training these models can incur significant expenses, often requiring tens of thousands of GPUs for months of continuous operation. Typically, this training is carried out in specialized GPU clusters equipped with homogeneous high-speed Remote Direct Memory Access (RDMA) network interface cards (NICs). The acquisition and maintenance of such dedicated clusters is challenging. Current LLM training frameworks, like Megatron-LM and Megatron-DeepSpeed, focus primarily on optimizing training within homogeneous cluster settings. In this paper, we introduce Holmes, a training framework for LLMs that employs thoughtfully crafted data and model parallelism strategies over the heterogeneous NIC environment. Our primary technical contribution lies in a novel scheduling method that intelligently allocates distinct computational tasklets in LLM training to specific groups of GPU devices based on the characteristics of their connected NICs. Furthermore, our proposed framework, utilizing pipeline parallel techniques, demonstrates scalability to multiple GPU clusters, even in scenarios without high-speed interconnects between nodes in distinct clusters. We conducted comprehensive experiments that involved various scenarios in the heterogeneous NIC environment. In most cases, our framework achieves performance levels close to those achievable with homogeneous RDMA-capable networks (InfiniBand or RoCE), significantly exceeding training efficiency within the pure Ethernet environment. Additionally, we verified that our framework outperforms other mainstream LLM frameworks under heterogeneous NIC environment in terms of training efficiency and can be seamlessly integrated with them.

Related papers

Efficient Split Learning LSTM Models for FPGA-based Edge IoT Devices [4.788487793976781]
Split Learning (SL) is an efficient paradigm for distributed Machine Learning (ML) suitable for the Internet Of Things (IoT)-Cloud systems. deploying SL on resource-constrained edge IoT platforms poses a significant challenge in terms of balancing the model performance against the processing, memory, and energy resources. We present a practical study of deploying SL framework on a real-world Field-Programmable Gate Array (FPGA)-based edge IoT platform.
arXiv Detail & Related papers (2025-02-12T15:51:39Z)
LCFed: An Efficient Clustered Federated Learning Framework for Heterogeneous Data [21.341280782748278]
Clustered federated learning (CFL) addresses the performance challenges posed by data heterogeneity in federated learning (FL) Existing CFL approaches strictly limit knowledge sharing to within clusters, lacking the integration of global knowledge with intra-cluster training. We propose LCFed, an efficient CFL framework to combat these challenges.
arXiv Detail & Related papers (2025-01-03T14:59:48Z)
WDMoE: Wireless Distributed Mixture of Experts for Large Language Models [68.45482959423323]
Large Language Models (LLMs) have achieved significant success in various natural language processing tasks. We propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wireless networks.
arXiv Detail & Related papers (2024-11-11T02:48:00Z)
ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting. atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput. Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z)
MP-SL: Multihop Parallel Split Learning [2.7716102039510564]
Multihop Parallel SL (MP-SL) is a modular and Machine Learning as a Service (ML) framework designed to facilitate the involvement of resource-constrained devices. MP-SL supports multihop Parallel SL-based training. This involves splitting the model into multiple parts and utilizing multiple compute nodes in a pipelined manner.
arXiv Detail & Related papers (2024-01-31T22:09:40Z)
Efficient Implementation of a Multi-Layer Gradient-Free Online-Trainable Spiking Neural Network on FPGA [0.31498833540989407]
ODESA is the first network to have end-to-end multi-layer online local supervised training without using gradients. This research shows that the network architecture and the online training of weights and thresholds can be implemented efficiently on a large scale in hardware.
arXiv Detail & Related papers (2023-05-31T00:34:15Z)
Hierarchical Personalized Federated Learning Over Massive Mobile Edge Computing Networks [95.39148209543175]
We propose hierarchical PFL (HPFL), an algorithm for deploying PFL over massive MEC networks. HPFL combines the objectives of training loss minimization and round latency minimization while jointly determining the optimal bandwidth allocation.
arXiv Detail & Related papers (2023-03-19T06:00:05Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
Unifying Synergies between Self-supervised Learning and Dynamic Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms. We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting. The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z)
Decentralized Training of Foundation Models in Heterogeneous Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive. We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z)
Multi-Edge Server-Assisted Dynamic Federated Learning with an Optimized Floating Aggregation Point [51.47520726446029]
cooperative edge learning (CE-FL) is a distributed machine learning architecture. We model the processes taken during CE-FL, and conduct analytical training. We show the effectiveness of our framework with the data collected from a real-world testbed.
arXiv Detail & Related papers (2022-03-26T00:41:57Z)
Semi-Decentralized Federated Edge Learning for Fast Convergence on Non-IID Data [14.269800282001464]
Federated edge learning (FEEL) has emerged as an effective approach to reduce the large communication latency in Cloud-based machine learning solutions. We investigate a novel framework of FEEL, namely semi-decentralized federated edge learning (SD-FEEL) By allowing model aggregation across different edge clusters, SD-FEEL enjoys the benefit of FEEL in reducing the training latency.
arXiv Detail & Related papers (2021-04-26T16:11:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.