Related papers: MAP: Memory-aware Automated Intra-op Parallel Training For Foundation Models

MAP: Memory-aware Automated Intra-op Parallel Training For Foundation Models

URL: http://arxiv.org/abs/2302.02599v1
Date: Mon, 6 Feb 2023 07:22:49 GMT
Title: MAP: Memory-aware Automated Intra-op Parallel Training For Foundation Models
Authors: Yuliang Liu, Shenggui Li, Jiarui Fang, Yanjun Shao, Boyuan Yao, Yang You
Abstract summary: We introduce MAP, a compiler built upon PyTorch to implement Memory-aware Automated Parallelization. Compared with existing methods, MAP provides an easy-to-use symbolic profiler to generate memory and computing statistics of an arbitrary PyTorch model.
Score: 15.256207550970501
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, large models have achieved the state of the art performances in various fields. In order to support large model training, we have to use distributed training techniques. However, finding an efficient distributed execution plan not only requires fine-grained model statistics, such as memory and computing overhead of each operator but also is a labor-intensive task even for an expert in the field of distributed training. In this paper, we introduce MAP, a compiler built upon PyTorch to implement Memory-aware Automated Parallelization. To profiling operator costs, existing training systems and machine learning pipelines either physically execute with respect to each operand or estimate the memory usage with a scaled input tensor, which are often time-consuming and misleading. Compared with existing methods, MAP provides an easy-to-use symbolic profiler to generate memory and computing statistics of an arbitrary PyTorch model with trivial time cost, so it will boost high productivity for ML developers. In addition, MAP can also seamlessly speed up different static planning tasks on computation graphs for PyTorch, and requires only a few lines of modification to user code to generate a new module instance that has a top-performing distributed execution plan. The source code is publicly available at https://github.com/hpcaitech/ColossalAI

Related papers

An Efficient Training Algorithm for Models with Block-wise Sparsity [6.882042556551613]
We propose an efficient training algorithm to decrease both computation and memory costs during training and inference. Our algorithms can decrease the computation and memory costs significantly without a performance drop compared to baselines.
arXiv Detail & Related papers (2025-03-27T19:14:27Z)
AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation. Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads. We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
Automatic Task Parallelization of Dataflow Graphs in ML/DL models [0.0]
We present a Linear Clustering approach to exploit inherent parallel paths in ML dataflow graphs. We generate readable and executable parallel Pytorch+Python code from input ML models in ONNX format. Preliminary results on several ML graphs demonstrate up to 1.9$times$ speedup over serial execution.
arXiv Detail & Related papers (2023-08-22T04:54:30Z)
Performance and Energy Consumption of Parallel Machine Learning Algorithms [0.0]
Machine learning models have achieved remarkable success in various real-world applications. Model training in machine learning requires large-scale data sets and multiple iterations before it can work properly. Parallelization of training algorithms is a common strategy to speed up the process of training.
arXiv Detail & Related papers (2023-05-01T13:04:39Z)
TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models [14.903847751841221]
We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
arXiv Detail & Related papers (2022-06-10T09:15:48Z)
Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning [40.09527159285327]
We build the first end-to-end and general-purpose system, called Walle, for device-cloud collaborative machine learning (ML) Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment. We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability.
arXiv Detail & Related papers (2022-05-30T03:43:35Z)
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning [54.99749970495241]
Alpa automates model-parallel training of large deep learning (DL) models. Alpa generates execution plans that unify data, operator, and pipeline parallelism. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans.
arXiv Detail & Related papers (2022-01-28T10:13:35Z)
Towards Efficient Post-training Quantization of Pre-trained Language Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues. Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z)
Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively. It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)
Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training [12.36664837965624]
This paper presents an approach to automatically shard the weight update across replicas. We show this technique achieves substantial speedups on typical image and language models on Cloud TPUs.
arXiv Detail & Related papers (2020-04-28T07:13:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.