Distributed SLIDE: Enabling Training Large Neural Networks on Low
Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity
- URL: http://arxiv.org/abs/2201.12667v1
- Date: Sat, 29 Jan 2022 21:37:34 GMT
- Title: Distributed SLIDE: Enabling Training Large Neural Networks on Low
Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity
- Authors: Minghao Yan, Nicholas Meisburger, Tharun Medini, Anshumali Shrivastava
- Abstract summary: This paper presents a distributed model-parallel training framework that enables training large neural networks on small CPU clusters with low Internet bandwidth.
We show that with reduced communication, due to sparsity, we can train close to a billion parameter model on simple 4-16 core CPU nodes connected by basic low bandwidth interconnect.
- Score: 36.254527362066725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: More than 70% of cloud computing is paid for but sits idle. A large fraction
of these idle compute are cheap CPUs with few cores that are not utilized
during the less busy hours. This paper aims to enable those CPU cycles to train
heavyweight AI models. Our goal is against mainstream frameworks, which focus
on leveraging expensive specialized ultra-high bandwidth interconnect to
address the communication bottleneck in distributed neural network training.
This paper presents a distributed model-parallel training framework that
enables training large neural networks on small CPU clusters with low Internet
bandwidth. We build upon the adaptive sparse training framework introduced by
the SLIDE algorithm. By carefully deploying sparsity over distributed nodes, we
demonstrate several orders of magnitude faster model parallel training than
Horovod, the main engine behind most commercial software. We show that with
reduced communication, due to sparsity, we can train close to a billion
parameter model on simple 4-16 core CPU nodes connected by basic low bandwidth
interconnect. Moreover, the training time is at par with some of the best
hardware accelerators.
Related papers
- Distributed Convolutional Neural Network Training on Mobile and Edge Clusters [0.9421843976231371]
Recent efforts have emerged to localize machine learning tasks fully on the edge.
This brings advantages in reduced latency and increased privacy, but necessitates working with resource-constrained devices.
We describe an approach for distributed CNN training exclusively on mobile and edge devices.
arXiv Detail & Related papers (2024-09-11T02:44:28Z) - Harnessing Manycore Processors with Distributed Memory for Accelerated
Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures.
We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z) - Communication-Free Distributed GNN Training with Vertex Cut [63.22674903170953]
CoFree-GNN is a novel distributed GNN training framework that significantly speeds up the training process by implementing communication-free training.
We demonstrate that CoFree-GNN speeds up the GNN training process by up to 10 times over the existing state-of-the-art GNN training approaches.
arXiv Detail & Related papers (2023-08-06T21:04:58Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z) - Moshpit SGD: Communication-Efficient Decentralized Training on
Heterogeneous Unreliable Devices [5.74369902800427]
Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes.
Running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters.
We propose Moshpit All-Reduce -- an iterative averaging protocol that exponentially converges to the global average.
arXiv Detail & Related papers (2021-03-04T18:58:05Z) - ItNet: iterative neural networks with small graphs for accurate and
efficient anytime prediction [1.52292571922932]
In this study, we introduce a class of network models that have a small memory footprint in terms of their computational graphs.
We show state-of-the-art results for semantic segmentation on the CamVid and Cityscapes datasets.
arXiv Detail & Related papers (2021-01-21T15:56:29Z) - Towards Scalable Distributed Training of Deep Learning on Public Cloud
Clusters [30.4449309904155]
We propose a new top-k sparsification communication library for distributed training.
We show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer.
arXiv Detail & Related papers (2020-10-20T17:16:29Z) - Neural Network Compression Framework for fast model inference [59.65531492759006]
We present a new framework for neural networks compression with fine-tuning, which we called Neural Network Compression Framework (NNCF)
It leverages recent advances of various network compression methods and implements some of them, such as sparsity, quantization, and binarization.
The framework can be used within the training samples, which are supplied with it, or as a standalone package that can be seamlessly integrated into the existing training code.
arXiv Detail & Related papers (2020-02-20T11:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.