Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for
DNN Workloads
- URL: http://arxiv.org/abs/2007.04069v1
- Date: Wed, 8 Jul 2020 12:38:03 GMT
- Title: Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for
DNN Workloads
- Authors: Siyu Wang, Yi Rong, Shiqing Fan, Zhen Zheng, LanSong Diao, Guoping
Long, Jun Yang, Xiaoyong Liu, Wei Lin
- Abstract summary: Auto-MAP is a framework for exploring distributed execution plans for workloads.
It can automatically discovering fast parallelization strategies through reinforcement learning on IR level of deep learning models.
Our evaluation shows that Auto-MAP can find the optimal solution in two hours, while achieving better throughput on several NLP and convolution models.
- Score: 11.646744408920764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The last decade has witnessed growth in the computational requirements for
training deep neural networks. Current approaches (e.g., data/model
parallelism, pipeline parallelism) parallelize training tasks onto multiple
devices. However, these approaches always rely on specific deep learning
frameworks and requires elaborate manual design, which make it difficult to
maintain and share between different type of models. In this paper, we propose
Auto-MAP, a framework for exploring distributed execution plans for DNN
workloads, which can automatically discovering fast parallelization strategies
through reinforcement learning on IR level of deep learning models. Efficient
exploration remains a major challenge for reinforcement learning. We leverage
DQN with task-specific pruning strategies to help efficiently explore the
search space including optimized strategies. Our evaluation shows that Auto-MAP
can find the optimal solution in two hours, while achieving better throughput
on several NLP and convolution models.
Related papers
- Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch [72.26822499434446]
Auto-Train-Once (ATO) is an innovative network pruning algorithm designed to automatically reduce the computational and storage costs of DNNs.
We provide a comprehensive convergence analysis as well as extensive experiments, and the results show that our approach achieves state-of-the-art performance across various model architectures.
arXiv Detail & Related papers (2024-03-21T02:33:37Z) - Improving Automatic Parallel Training via Balanced Memory Workload
Optimization [36.87527680184956]
Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains.
We present Galvatron-BMW, a novel system framework that integrates multiple parallelism prevalent dimensions and automatically identifies the most efficient hybrid parallelism strategy.
Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints.
arXiv Detail & Related papers (2023-07-05T05:28:38Z) - DeepSeq: Deep Sequential Circuit Learning [10.402436619244911]
Circuit representation learning is a promising research direction in the electronic design automation (EDA) field.
Existing solutions only target combinational circuits, significantly limiting their applications.
We propose DeepSeq, a novel representation learning framework for sequential netlists.
arXiv Detail & Related papers (2023-02-27T09:17:35Z) - TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic
Parallelisation [19.009600866053923]
We present a model parallelism framework TAP that automatically searches for the best data and tensor parallel schedules.
Experiments show that TAP is $20times- 160times$ faster than the state-of-the-art automatic parallelism framework.
arXiv Detail & Related papers (2023-02-01T05:22:28Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - DistIR: An Intermediate Representation and Simulator for Efficient
Neural Network Distribution [15.086401550425125]
DistIR is a representation for distributed computation that is tailored for efficient analyses.
We show how DistIR and its simulator enable fast grid searches over complex distribution spaces spanning up to 1000+ configurations.
arXiv Detail & Related papers (2021-11-09T21:32:51Z) - Elastic Architecture Search for Diverse Tasks with Different Resources [87.23061200971912]
We study a new challenging problem of efficient deployment for diverse tasks with different resources, where the resource constraint and task of interest corresponding to a group of classes are dynamically specified at testing time.
Previous NAS approaches seek to design architectures for all classes simultaneously, which may not be optimal for some individual tasks.
We present a novel and general framework, called Elastic Architecture Search (EAS), permitting instant specializations at runtime for diverse tasks with various resource constraints.
arXiv Detail & Related papers (2021-08-03T00:54:27Z) - AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures.
We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS.
Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z) - GAKP: GRU Association and Kalman Prediction for Multiple Object Tracking [8.559199703957393]
Multiple Object Tracking (MOT) has been a useful yet challenging task in many real-world applications such as video surveillance, intelligent retail, and smart city.
We propose a novel tracking method that integrates the auto-tuning Kalman method for prediction and the Gated Recurrent Unit (GRU) and achieves a near-optimum with a small amount of training data.
arXiv Detail & Related papers (2020-12-28T15:52:24Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z) - Auto-PyTorch Tabular: Multi-Fidelity MetaLearning for Efficient and
Robust AutoDL [53.40030379661183]
Auto-PyTorch is a framework to enable fully automated deep learning (AutoDL)
It combines multi-fidelity optimization with portfolio construction for warmstarting and ensembling of deep neural networks (DNNs)
We show that Auto-PyTorch performs better than several state-of-the-art competitors on average.
arXiv Detail & Related papers (2020-06-24T15:15:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.