MAP: Memory-aware Automated Intra-op Parallel Training For Foundation
Models
- URL: http://arxiv.org/abs/2302.02599v1
- Date: Mon, 6 Feb 2023 07:22:49 GMT
- Title: MAP: Memory-aware Automated Intra-op Parallel Training For Foundation
Models
- Authors: Yuliang Liu, Shenggui Li, Jiarui Fang, Yanjun Shao, Boyuan Yao, Yang
You
- Abstract summary: We introduce MAP, a compiler built upon PyTorch to implement Memory-aware Automated Parallelization.
Compared with existing methods, MAP provides an easy-to-use symbolic profiler to generate memory and computing statistics of an arbitrary PyTorch model.
- Score: 15.256207550970501
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, large models have achieved the state of the art performances in
various fields. In order to support large model training, we have to use
distributed training techniques. However, finding an efficient distributed
execution plan not only requires fine-grained model statistics, such as memory
and computing overhead of each operator but also is a labor-intensive task even
for an expert in the field of distributed training. In this paper, we introduce
MAP, a compiler built upon PyTorch to implement Memory-aware Automated
Parallelization. To profiling operator costs, existing training systems and
machine learning pipelines either physically execute with respect to each
operand or estimate the memory usage with a scaled input tensor, which are
often time-consuming and misleading. Compared with existing methods, MAP
provides an easy-to-use symbolic profiler to generate memory and computing
statistics of an arbitrary PyTorch model with trivial time cost, so it will
boost high productivity for ML developers. In addition, MAP can also seamlessly
speed up different static planning tasks on computation graphs for PyTorch, and
requires only a few lines of modification to user code to generate a new module
instance that has a top-performing distributed execution plan. The source code
is publicly available at https://github.com/hpcaitech/ColossalAI
Related papers
- Automatic Task Parallelization of Dataflow Graphs in ML/DL models [0.0]
We present a Linear Clustering approach to exploit inherent parallel paths in ML dataflow graphs.
We generate readable and executable parallel Pytorch+Python code from input ML models in ONNX format.
Preliminary results on several ML graphs demonstrate up to 1.9$times$ speedup over serial execution.
arXiv Detail & Related papers (2023-08-22T04:54:30Z) - Performance and Energy Consumption of Parallel Machine Learning
Algorithms [0.0]
Machine learning models have achieved remarkable success in various real-world applications.
Model training in machine learning requires large-scale data sets and multiple iterations before it can work properly.
Parallelization of training algorithms is a common strategy to speed up the process of training.
arXiv Detail & Related papers (2023-05-01T13:04:39Z) - TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - Merak: An Efficient Distributed DNN Training Framework with Automated 3D
Parallelism for Giant Foundation Models [14.903847751841221]
We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization.
Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model.
Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
arXiv Detail & Related papers (2022-06-10T09:15:48Z) - Walle: An End-to-End, General-Purpose, and Large-Scale Production System
for Device-Cloud Collaborative Machine Learning [40.09527159285327]
We build the first end-to-end and general-purpose system, called Walle, for device-cloud collaborative machine learning (ML)
Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment.
We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability.
arXiv Detail & Related papers (2022-05-30T03:43:35Z) - Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed
Deep Learning [54.99749970495241]
Alpa automates model-parallel training of large deep learning (DL) models.
Alpa generates execution plans that unify data, operator, and pipeline parallelism.
Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans.
arXiv Detail & Related papers (2022-01-28T10:13:35Z) - Towards Efficient Post-training Quantization of Pre-trained Language
Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues.
Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - Automatic Cross-Replica Sharding of Weight Update in Data-Parallel
Training [12.36664837965624]
This paper presents an approach to automatically shard the weight update across replicas.
We show this technique achieves substantial speedups on typical image and language models on Cloud TPUs.
arXiv Detail & Related papers (2020-04-28T07:13:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.