Related papers: Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

URL: http://arxiv.org/abs/2201.12667v1
Date: Sat, 29 Jan 2022 21:37:34 GMT
Title: Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity
Authors: Minghao Yan, Nicholas Meisburger, Tharun Medini, Anshumali Shrivastava
Abstract summary: This paper presents a distributed model-parallel training framework that enables training large neural networks on small CPU clusters with low Internet bandwidth. We show that with reduced communication, due to sparsity, we can train close to a billion parameter model on simple 4-16 core CPU nodes connected by basic low bandwidth interconnect.
Score: 36.254527362066725
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: More than 70% of cloud computing is paid for but sits idle. A large fraction of these idle compute are cheap CPUs with few cores that are not utilized during the less busy hours. This paper aims to enable those CPU cycles to train heavyweight AI models. Our goal is against mainstream frameworks, which focus on leveraging expensive specialized ultra-high bandwidth interconnect to address the communication bottleneck in distributed neural network training. This paper presents a distributed model-parallel training framework that enables training large neural networks on small CPU clusters with low Internet bandwidth. We build upon the adaptive sparse training framework introduced by the SLIDE algorithm. By carefully deploying sparsity over distributed nodes, we demonstrate several orders of magnitude faster model parallel training than Horovod, the main engine behind most commercial software. We show that with reduced communication, due to sparsity, we can train close to a billion parameter model on simple 4-16 core CPU nodes connected by basic low bandwidth interconnect. Moreover, the training time is at par with some of the best hardware accelerators.

Related papers

DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster [7.597885871452736]
We propose DiLoCoX, a low-communication large-scale decentralized cluster training framework.<n>It combines Pipeline Parallelism with Dual-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme.<n>We show that DiLoCoX can achieve a 357x speedup in distributed training while maintaining negligible degradation in model convergence.
arXiv Detail & Related papers (2025-06-26T13:45:04Z)
NNTile: a machine learning framework capable of training extremely large GPT language models on a single node [83.9328245724548]
NNTile is based on a StarPU library, which implements task-based parallelism and schedules all provided tasks onto all available processing units. It means that a particular operation, necessary to train a large neural network, can be performed on any of the CPU cores or GPU devices.
arXiv Detail & Related papers (2025-04-17T16:22:32Z)
Distributed Convolutional Neural Network Training on Mobile and Edge Clusters [0.9421843976231371]
Recent efforts have emerged to localize machine learning tasks fully on the edge. This brings advantages in reduced latency and increased privacy, but necessitates working with resource-constrained devices. We describe an approach for distributed CNN training exclusively on mobile and edge devices.
arXiv Detail & Related papers (2024-09-11T02:44:28Z)
Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures. We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z)
Communication-Free Distributed GNN Training with Vertex Cut [63.22674903170953]
CoFree-GNN is a novel distributed GNN training framework that significantly speeds up the training process by implementing communication-free training. We demonstrate that CoFree-GNN speeds up the GNN training process by up to 10 times over the existing state-of-the-art GNN training approaches.
arXiv Detail & Related papers (2023-08-06T21:04:58Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance. We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z)
DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters. Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z)
Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices [5.74369902800427]
Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes. Running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters. We propose Moshpit All-Reduce -- an iterative averaging protocol that exponentially converges to the global average.
arXiv Detail & Related papers (2021-03-04T18:58:05Z)
ItNet: iterative neural networks with small graphs for accurate and efficient anytime prediction [1.52292571922932]
In this study, we introduce a class of network models that have a small memory footprint in terms of their computational graphs. We show state-of-the-art results for semantic segmentation on the CamVid and Cityscapes datasets.
arXiv Detail & Related papers (2021-01-21T15:56:29Z)
Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters [30.4449309904155]
We propose a new top-k sparsification communication library for distributed training. We show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer.
arXiv Detail & Related papers (2020-10-20T17:16:29Z)
Neural Network Compression Framework for fast model inference [59.65531492759006]
We present a new framework for neural networks compression with fine-tuning, which we called Neural Network Compression Framework (NNCF) It leverages recent advances of various network compression methods and implements some of them, such as sparsity, quantization, and binarization. The framework can be used within the training samples, which are supplied with it, or as a standalone package that can be seamlessly integrated into the existing training code.
arXiv Detail & Related papers (2020-02-20T11:24:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.