Auto-Parallelizing Large Models with Rhino: A Systematic Approach on
Production AI Platform
- URL: http://arxiv.org/abs/2302.08141v1
- Date: Thu, 16 Feb 2023 08:19:56 GMT
- Title: Auto-Parallelizing Large Models with Rhino: A Systematic Approach on
Production AI Platform
- Authors: Shiwei Zhang, Lansong Diao, Siyu Wang, Zongyan Cao, Yiliang Gu, Chang
Si, Ziji Shi, Zhen Zheng, Chuan Wu, Wei Lin
- Abstract summary: Rhino is a system for accelerating tensor programs with automatic parallelization on AI platform for real production environment.
It transforms a tensor program written for a single device into an equivalent distributed program that is capable of scaling up to thousands of devices with no user configuration.
- Score: 15.606647290942563
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Rhino, a system for accelerating tensor programs with automatic
parallelization on AI platform for real production environment. It transforms a
tensor program written for a single device into an equivalent distributed
program that is capable of scaling up to thousands of devices with no user
configuration. Rhino firstly works on a semantically independent intermediate
representation of tensor programs, which facilitates its generalization to
unprecedented applications. Additionally, it implements a task-oriented
controller and a distributed runtime for optimal performance. Rhino explores on
a complete and systematic parallelization strategy space that comprises all the
paradigms commonly employed in deep learning (DL), in addition to strided
partitioning and pipeline parallelism on non-linear models. Aiming to
efficiently search for a near-optimal parallel execution plan, our analysis of
production clusters reveals general heuristics to speed up the strategy search.
On top of it, two optimization levels are designed to offer users flexible
trade-offs between the search time and strategy quality. Our experiments
demonstrate that Rhino can not only re-discover the expert-crafted strategies
of classic, research and production DL models, but also identify novel
parallelization strategies which surpass existing systems for novel models.
Related papers
- ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - Improving Automatic Parallel Training via Balanced Memory Workload
Optimization [36.87527680184956]
Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains.
We present Galvatron-BMW, a novel system framework that integrates multiple parallelism prevalent dimensions and automatically identifies the most efficient hybrid parallelism strategy.
Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints.
arXiv Detail & Related papers (2023-07-05T05:28:38Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - Exploring Techniques for the Analysis of Spontaneous Asynchronicity in
MPI-Parallel Applications [0.8889304968879161]
We run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms.
We show how desynchronization patterns can be readily identified from a data set that is much smaller than a full MPI trace.
arXiv Detail & Related papers (2022-05-27T13:19:07Z) - Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed
Deep Learning [54.99749970495241]
Alpa automates model-parallel training of large deep learning (DL) models.
Alpa generates execution plans that unify data, operator, and pipeline parallelism.
Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans.
arXiv Detail & Related papers (2022-01-28T10:13:35Z) - Automap: Towards Ergonomic Automated Parallelism for ML Models [2.469997094590327]
We present the prototype of an automated partitioner that seamlessly integrates into existing compilers and existing user.
Our partitioner enables SPMD-style parallelism that encompasses data parallelism and parameter/activation sharding.
Through a combination of inductive tactics and search in a platform-independent partitioning IR, automap can recover expert partitioning strategies such as Megatron sharding for transformer layers.
arXiv Detail & Related papers (2021-12-06T12:09:38Z) - DistIR: An Intermediate Representation and Simulator for Efficient
Neural Network Distribution [15.086401550425125]
DistIR is a representation for distributed computation that is tailored for efficient analyses.
We show how DistIR and its simulator enable fast grid searches over complex distribution spaces spanning up to 1000+ configurations.
arXiv Detail & Related papers (2021-11-09T21:32:51Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Optimizing Streaming Parallelism on Heterogeneous Many-Core
Architectures: A Machine Learning Based Approach [16.702537371391053]
This article presents an automatic approach to derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures.
Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration.
Compared to the single-stream version, our approach achieves a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively.
arXiv Detail & Related papers (2020-03-05T21:18:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.