Towards Scalable Distributed Training of Deep Learning on Public Cloud
Clusters
- URL: http://arxiv.org/abs/2010.10458v1
- Date: Tue, 20 Oct 2020 17:16:29 GMT
- Title: Towards Scalable Distributed Training of Deep Learning on Public Cloud
Clusters
- Authors: Shaohuai Shi, Xianhao Zhou, Shutao Song, Xingyao Wang, Zilin Zhu, Xue
Huang, Xinan Jiang, Feihu Zhou, Zhenyu Guo, Liqiang Xie, Rui Lan, Xianbin
Ouyang, Yan Zhang, Jieqian Wei, Jing Gong, Weiliang Lin, Ping Gao, Peng Meng,
Xiaomin Xu, Chenyang Guo, Bo Yang, Zhibo Chen, Yongjian Wu and Xiaowen Chu
- Abstract summary: We propose a new top-k sparsification communication library for distributed training.
We show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer.
- Score: 30.4449309904155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed training techniques have been widely deployed in large-scale deep
neural networks (DNNs) training on dense-GPU clusters. However, on public cloud
clusters, due to the moderate inter-connection bandwidth between instances,
traditional state-of-the-art distributed training systems cannot scale well in
training large-scale models. In this paper, we propose a new computing and
communication efficient top-k sparsification communication library for
distributed training. To further improve the system scalability, we optimize
I/O by proposing a simple yet efficient multi-level data caching mechanism and
optimize the update operation by introducing a novel parallel tensor operator.
Experimental results on a 16-node Tencent Cloud cluster (each node with 8
Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than
existing state-of-the-art systems on CNNs and Transformer. We finally break the
record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.
Related papers
- CDFGNN: a Systematic Design of Cache-based Distributed Full-Batch Graph Neural Network Training with Communication Reduction [7.048300785744331]
Graph neural network training is mainly categorized into mini-batch and full-batch training methods.
In the distributed cluster, frequent remote accesses of features and gradients lead to huge communication overhead.
We introduce the cached-based distributed full-batch graph neural network training framework (CDFGNN)
Our results indicate that CDFGNN has great potential in accelerating distributed full-batch GNN training tasks.
arXiv Detail & Related papers (2024-08-01T01:57:09Z) - Effective pruning of web-scale datasets based on complexity of concept
clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z) - Communication-Free Distributed GNN Training with Vertex Cut [63.22674903170953]
CoFree-GNN is a novel distributed GNN training framework that significantly speeds up the training process by implementing communication-free training.
We demonstrate that CoFree-GNN speeds up the GNN training process by up to 10 times over the existing state-of-the-art GNN training approaches.
arXiv Detail & Related papers (2023-08-06T21:04:58Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z) - Distributed SLIDE: Enabling Training Large Neural Networks on Low
Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity [36.254527362066725]
This paper presents a distributed model-parallel training framework that enables training large neural networks on small CPU clusters with low Internet bandwidth.
We show that with reduced communication, due to sparsity, we can train close to a billion parameter model on simple 4-16 core CPU nodes connected by basic low bandwidth interconnect.
arXiv Detail & Related papers (2022-01-29T21:37:34Z) - Federated Dynamic Sparse Training: Computing Less, Communicating Less,
Yet Learning Better [88.28293442298015]
Federated learning (FL) enables distribution of machine learning workloads from the cloud to resource-limited edge devices.
We develop, implement, and experimentally validate a novel FL framework termed Federated Dynamic Sparse Training (FedDST)
FedDST is a dynamic process that extracts and trains sparse sub-networks from the target full network.
arXiv Detail & Related papers (2021-12-18T02:26:38Z) - Efficient deep learning models for land cover image classification [0.29748898344267777]
This work experiments with the BigEarthNet dataset for land use land cover (LULC) image classification.
We benchmark different state-of-the-art models, including Convolution Neural Networks, Multi-Layer Perceptrons, Visual Transformers, EfficientNets and Wide Residual Networks (WRN)
Our proposed lightweight model has an order of magnitude less trainable parameters, achieves 4.5% higher averaged f-score classification accuracy for all 19 LULC classes and is trained two times faster with respect to a ResNet50 state-of-the-art model that we use as a baseline.
arXiv Detail & Related papers (2021-11-18T00:03:14Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - Accelerating Neural Network Training with Distributed Asynchronous and
Selective Optimization (DASO) [0.0]
We introduce the Distributed Asynchronous and Selective Optimization (DASO) method to accelerate network training.
DASO uses a hierarchical and asynchronous communication scheme comprised of node-local and global networks.
We show that DASO yields a reduction in training time of up to 34% on classical and state-of-the-art networks.
arXiv Detail & Related papers (2021-04-12T16:02:20Z) - Weight Update Skipping: Reducing Training Time for Artificial Neural
Networks [0.30458514384586394]
We propose a new training methodology for ANNs that exploits the observation of improvement of accuracy shows temporal variations.
During such time windows, we keep updating bias which ensures the network still trains and avoids overfitting.
Such a training approach virtually achieves the same accuracy with considerably less computational cost, thus lower training time.
arXiv Detail & Related papers (2020-12-05T15:12:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.