Related papers: Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

URL: http://arxiv.org/abs/2103.03239v1
Date: Thu, 4 Mar 2021 18:58:05 GMT
Title: Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices
Authors: Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk, Gennady Pekhimenko
Abstract summary: Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes. Running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters. We propose Moshpit All-Reduce -- an iterative averaging protocol that exponentially converges to the global average.
Score: 5.74369902800427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes. This approach, known as distributed training, can utilize hundreds of computers via specialized message-passing protocols such as Ring All-Reduce. However, running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters. In contrast, many real-world applications, such as federated learning and cloud-based distributed training, operate on unreliable devices with unstable network bandwidth. As a result, these applications are restricted to using parameter servers or gossip-based averaging protocols. In this work, we lift that restriction by proposing Moshpit All-Reduce -- an iterative averaging protocol that exponentially converges to the global average. We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees. The experiments show 1.3x speedup for ResNet-50 training on ImageNet compared to competitive gossip-based strategies and 1.5x speedup when training ALBERT-large from scratch using preemptible compute nodes.

Related papers

DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster [7.597885871452736]
We propose DiLoCoX, a low-communication large-scale decentralized cluster training framework.<n>It combines Pipeline Parallelism with Dual-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme.<n>We show that DiLoCoX can achieve a 357x speedup in distributed training while maintaining negligible degradation in model convergence.
arXiv Detail & Related papers (2025-06-26T13:45:04Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
SpikePipe: Accelerated Training of Spiking Neural Networks via Inter-Layer Pipelining and Multiprocessor Scheduling [5.2831841848274985]
Training Spiking Neural Networks (SNNs) is computationally expensive compared to their conventional counterparts. This is the first paper to propose inter-layer pipelining to accelerate training in SNNs using systolic array-based processors and multiprocessor scheduling.
arXiv Detail & Related papers (2024-06-11T01:43:45Z)
DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining [22.168137965177284]
Communication scheduling has been shown to be effective in accelerating distributed training. We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations. We show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions.
arXiv Detail & Related papers (2023-02-24T04:11:18Z)
Training Spiking Neural Networks with Local Tandem Learning [96.32026780517097]
Spiking neural networks (SNNs) are shown to be more biologically plausible and energy efficient than their predecessors. In this paper, we put forward a generalized learning rule, termed Local Tandem Learning (LTL) We demonstrate rapid network convergence within five training epochs on the CIFAR-10 dataset while having low computational complexity.
arXiv Detail & Related papers (2022-10-10T10:05:00Z)
Distributed Adversarial Training to Robustify Deep Neural Networks at Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification. To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training. We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z)
FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z)
Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity [36.254527362066725]
This paper presents a distributed model-parallel training framework that enables training large neural networks on small CPU clusters with low Internet bandwidth. We show that with reduced communication, due to sparsity, we can train close to a billion parameter model on simple 4-16 core CPU nodes connected by basic low bandwidth interconnect.
arXiv Detail & Related papers (2022-01-29T21:37:34Z)
Collaborative Learning over Wireless Networks: An Introductory Overview [84.09366153693361]
We will mainly focus on collaborative training across wireless devices. Many distributed optimization algorithms have been developed over the last decades. They provide data locality; that is, a joint model can be trained collaboratively while the data available at each participating device remains local.
arXiv Detail & Related papers (2021-12-07T20:15:39Z)
Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters [30.4449309904155]
We propose a new top-k sparsification communication library for distributed training. We show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer.
arXiv Detail & Related papers (2020-10-20T17:16:29Z)
Is Network the Bottleneck of Distributed Training? [36.925680383195356]
We take a first-principles approach to measure and analyze the network performance of distributed training. We find that the network is running at low utilization and that if the network can be fully utilized, distributed training can achieve a scaling factor of close to one.
arXiv Detail & Related papers (2020-06-17T19:00:31Z)
Neural Network Compression Framework for fast model inference [59.65531492759006]
We present a new framework for neural networks compression with fine-tuning, which we called Neural Network Compression Framework (NNCF) It leverages recent advances of various network compression methods and implements some of them, such as sparsity, quantization, and binarization. The framework can be used within the training samples, which are supplied with it, or as a standalone package that can be seamlessly integrated into the existing training code.
arXiv Detail & Related papers (2020-02-20T11:24:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.