Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
- URL: http://arxiv.org/abs/2312.03549v4
- Date: Mon, 29 Apr 2024 06:56:25 GMT
- Title: Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
- Authors: Fei Yang, Shuang Peng, Ning Sun, Fangyu Wang, Yuanyuan Wang, Fu Wu, Jiezhong Qiu, Aimin Pan,
- Abstract summary: Large language models (LLMs) such as GPT-3, OPT, and LLaMA have demonstrated remarkable accuracy in a wide range of tasks.
Training these models can incur significant expenses, often requiring tens of thousands of GPU for months of continuous operation.
We introduce Holmes, a training framework for LLMs that employs thoughtfully crafted data and model parallelism strategies over the heterogeneous NIC environment.
- Score: 8.30319294116657
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) such as GPT-3, OPT, and LLaMA have demonstrated remarkable accuracy in a wide range of tasks. However, training these models can incur significant expenses, often requiring tens of thousands of GPUs for months of continuous operation. Typically, this training is carried out in specialized GPU clusters equipped with homogeneous high-speed Remote Direct Memory Access (RDMA) network interface cards (NICs). The acquisition and maintenance of such dedicated clusters is challenging. Current LLM training frameworks, like Megatron-LM and Megatron-DeepSpeed, focus primarily on optimizing training within homogeneous cluster settings. In this paper, we introduce Holmes, a training framework for LLMs that employs thoughtfully crafted data and model parallelism strategies over the heterogeneous NIC environment. Our primary technical contribution lies in a novel scheduling method that intelligently allocates distinct computational tasklets in LLM training to specific groups of GPU devices based on the characteristics of their connected NICs. Furthermore, our proposed framework, utilizing pipeline parallel techniques, demonstrates scalability to multiple GPU clusters, even in scenarios without high-speed interconnects between nodes in distinct clusters. We conducted comprehensive experiments that involved various scenarios in the heterogeneous NIC environment. In most cases, our framework achieves performance levels close to those achievable with homogeneous RDMA-capable networks (InfiniBand or RoCE), significantly exceeding training efficiency within the pure Ethernet environment. Additionally, we verified that our framework outperforms other mainstream LLM frameworks under heterogeneous NIC environment in terms of training efficiency and can be seamlessly integrated with them.
Related papers
- DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism [14.539699026008746]
Dynamic Hybrid Parallelism (DHP) is an efficient strategy that adaptively reconfigures communication groups and parallelism during MLLM training.<n>DHP significantly outperforms Megatron-LM and DeepSpeed, achieving up to 1.36 $times$ speedup in training throughput.
arXiv Detail & Related papers (2026-02-25T11:11:53Z) - Training Report of TeleChat3-MoE [77.94641922160359]
This technical report mainly presents the underlying training infrastructure that enables reliable and efficient scaling to frontier model sizes.<n>We detail systematic methodologies for operator-level and end-to-end numerical verification accuracy, ensuring consistency across hardware platforms.<n>A systematic parallelization framework, leveraging analytical estimation and integer linear programming, is also proposed to optimize multi-dimensional parallelism configurations.
arXiv Detail & Related papers (2025-12-30T11:42:14Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z) - Cluster Topology-Driven Placement of Experts Reduces Network Traffic in MoE Inference [49.141930185079325]
We propose an integer linear program (ILP) that determines the optimal placement of experts, minimizing the expected number of transmissions.<n>We demonstrate that ILP-based placement strategy yields lower network traffic than competitors for small-scale (DeepSeekMoE16B) and large-scale (DeepSeek-R1671B) models.
arXiv Detail & Related papers (2025-08-12T07:08:48Z) - Efficient Split Learning LSTM Models for FPGA-based Edge IoT Devices [4.788487793976781]
Split Learning (SL) is an efficient paradigm for distributed Machine Learning (ML) suitable for the Internet Of Things (IoT)-Cloud systems.
deploying SL on resource-constrained edge IoT platforms poses a significant challenge in terms of balancing the model performance against the processing, memory, and energy resources.
We present a practical study of deploying SL framework on a real-world Field-Programmable Gate Array (FPGA)-based edge IoT platform.
arXiv Detail & Related papers (2025-02-12T15:51:39Z) - LCFed: An Efficient Clustered Federated Learning Framework for Heterogeneous Data [21.341280782748278]
Clustered federated learning (CFL) addresses the performance challenges posed by data heterogeneity in federated learning (FL)
Existing CFL approaches strictly limit knowledge sharing to within clusters, lacking the integration of global knowledge with intra-cluster training.
We propose LCFed, an efficient CFL framework to combat these challenges.
arXiv Detail & Related papers (2025-01-03T14:59:48Z) - WDMoE: Wireless Distributed Mixture of Experts for Large Language Models [68.45482959423323]
Large Language Models (LLMs) have achieved significant success in various natural language processing tasks.
We propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wireless networks.
arXiv Detail & Related papers (2024-11-11T02:48:00Z) - ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting.
atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput.
Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z) - MP-SL: Multihop Parallel Split Learning [2.7716102039510564]
Multihop Parallel SL (MP-SL) is a modular and Machine Learning as a Service (ML) framework designed to facilitate the involvement of resource-constrained devices.
MP-SL supports multihop Parallel SL-based training. This involves splitting the model into multiple parts and utilizing multiple compute nodes in a pipelined manner.
arXiv Detail & Related papers (2024-01-31T22:09:40Z) - Efficient Implementation of a Multi-Layer Gradient-Free Online-Trainable
Spiking Neural Network on FPGA [0.31498833540989407]
ODESA is the first network to have end-to-end multi-layer online local supervised training without using gradients.
This research shows that the network architecture and the online training of weights and thresholds can be implemented efficiently on a large scale in hardware.
arXiv Detail & Related papers (2023-05-31T00:34:15Z) - Hierarchical Personalized Federated Learning Over Massive Mobile Edge
Computing Networks [95.39148209543175]
We propose hierarchical PFL (HPFL), an algorithm for deploying PFL over massive MEC networks.
HPFL combines the objectives of training loss minimization and round latency minimization while jointly determining the optimal bandwidth allocation.
arXiv Detail & Related papers (2023-03-19T06:00:05Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - Decentralized Training of Foundation Models in Heterogeneous
Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive.
We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z) - Multi-Edge Server-Assisted Dynamic Federated Learning with an Optimized
Floating Aggregation Point [51.47520726446029]
cooperative edge learning (CE-FL) is a distributed machine learning architecture.
We model the processes taken during CE-FL, and conduct analytical training.
We show the effectiveness of our framework with the data collected from a real-world testbed.
arXiv Detail & Related papers (2022-03-26T00:41:57Z) - Semi-Decentralized Federated Edge Learning for Fast Convergence on Non-IID Data [14.269800282001464]
Federated edge learning (FEEL) has emerged as an effective approach to reduce the large communication latency in Cloud-based machine learning solutions.
We investigate a novel framework of FEEL, namely semi-decentralized federated edge learning (SD-FEEL)
By allowing model aggregation across different edge clusters, SD-FEEL enjoys the benefit of FEEL in reducing the training latency.
arXiv Detail & Related papers (2021-04-26T16:11:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.