dPRO: A Generic Profiling and Optimization System for Expediting
Distributed DNN Training
- URL: http://arxiv.org/abs/2205.02473v1
- Date: Thu, 5 May 2022 07:15:25 GMT
- Title: dPRO: A Generic Profiling and Optimization System for Expediting
Distributed DNN Training
- Authors: Hanpeng Hu, Chenyu Jiang, Yuchen Zhong, Yanghua Peng, Chuan Wu, Yibo
Zhu, Haibin Lin, Chuanxiong Guo
- Abstract summary: This paper proposes dPRO, a tool to identify performance bottlenecks in distributed training systems.
We implement dPRO on multiple deep learning frameworks (PyTorch, MXNet, AllReduce and Server architecture) and representative communication schemes.
Extensive experiments show that dPRO predicts performance of distributed training in various settings with5% errors in most cases and finds optimization strategies with up to87.1%-up over the baselines.
- Score: 12.413533491501548
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Distributed training using multiple devices (i.e., GPU servers) has been
widely adopted for learning DNN models over large datasets. However, the
performance of large-scale distributed training tends to be far from linear
speed-up in practice. Given the complexity of distributed systems, it is
challenging to identify the root cause(s) of inefficiency and exercise
effective performance optimizations when unexpected low training speed occurs.
To date, there exists no software tool which diagnoses performance issues and
helps expedite distributed DNN training, while the training can be run using
different machine learning frameworks. This paper proposes dPRO, a toolkit that
includes: (1) an efficient profiler that collects runtime traces of distributed
DNN training across multiple frameworks, especially fine-grained communication
traces, and constructs global data flow graphs including detailed communication
operations for accurate replay; (2) an optimizer that effectively identifies
performance bottlenecks and explores optimization strategies (from computation,
communication and memory aspects) for training acceleration. We implement dPRO
on multiple deep learning frameworks (PyTorch, TensorFlow, MXNet) and
representative communication schemes (AllReduce and Parameter Server
architecture). Extensive experiments show that dPRO predicts performance of
distributed training in various settings with<5% errors in most cases and finds
optimization strategies with up to87.1%speed-up over the baselines.
Related papers
- DCP: Learning Accelerator Dataflow for Neural Network via Propagation [52.06154296196845]
This work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort.
DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives.
For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples.
arXiv Detail & Related papers (2024-10-09T05:16:44Z) - RESPECT: Reinforcement Learning based Edge Scheduling on Pipelined Coral
Edge TPUs [12.952987240366781]
This work presents a reinforcement learning (RL) based scheduling framework, which learns the behaviors of optimal optimization algorithms.
RL generates near-optimal scheduling results with short solving runtime overhead.
Our framework has demonstrated up to $sim2.5times$ real-world on-chip runtime inference speedups over the commercial compiler.
arXiv Detail & Related papers (2023-04-10T17:22:12Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles.
We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates.
We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z) - Distributed Adversarial Training to Robustify Deep Neural Networks at
Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification.
To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training.
We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z) - DistIR: An Intermediate Representation and Simulator for Efficient
Neural Network Distribution [15.086401550425125]
DistIR is a representation for distributed computation that is tailored for efficient analyses.
We show how DistIR and its simulator enable fast grid searches over complex distribution spaces spanning up to 1000+ configurations.
arXiv Detail & Related papers (2021-11-09T21:32:51Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z) - Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware
Multifaceted Optimizations [15.659251804042748]
Woodpecker-DL (WPK) is a hardware-aware deep learning framework.
WPK uses graph optimization, automated searches, domain-specific language ( DSL) and system-level exploration to accelerate inference.
We show that on a maximum P100 GPU, we can achieve the speedup of 5.40 over cuDNN and 1.63 over TVM on individual operators, and run up to 1.18 times faster than TeslaRT for end-to-end model inference.
arXiv Detail & Related papers (2020-08-11T07:50:34Z) - HyperTune: Dynamic Hyperparameter Tuning For Efficient Distribution of
DNN Training Over Heterogeneous Systems [1.4680035572775532]
This paper describes distributed training of Deep Neural Networks (DNN) on computational storage devices (CSD)
A CSD-based distributed architecture incorporates the advantages of federated learning in terms of performance scalability, resiliency, and data privacy.
The paper also describes Stannis, a DNN training framework that improves on the shortcomings of existing distributed training frameworks.
arXiv Detail & Related papers (2020-07-16T02:12:44Z) - Automated Design Space Exploration for optimised Deployment of DNN on
Arm Cortex-A CPUs [13.628734116014819]
Deep learning on embedded devices has prompted the development of numerous methods to optimise the deployment of deep neural networks (DNN)
There is a lack of research on cross-level optimisation as the space of approaches becomes too large to test and obtain a globally optimised solution.
We present a set of results for state-of-the-art DNNs on a range of Arm Cortex-A CPU platforms achieving up to 4x improvement in performance and over 2x reduction in memory.
arXiv Detail & Related papers (2020-06-09T11:00:06Z) - Understanding the Effects of Data Parallelism and Sparsity on Neural
Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity.
Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.