Automap: Towards Ergonomic Automated Parallelism for ML Models
- URL: http://arxiv.org/abs/2112.02958v1
- Date: Mon, 6 Dec 2021 12:09:38 GMT
- Title: Automap: Towards Ergonomic Automated Parallelism for ML Models
- Authors: Michael Schaarschmidt and Dominik Grewe and Dimitrios Vytiniotis and
Adam Paszke and Georg Stefan Schmid and Tamara Norman and James Molloy and
Jonathan Godwin and Norman Alexander Rink and Vinod Nair and Dan Belov
- Abstract summary: We present the prototype of an automated partitioner that seamlessly integrates into existing compilers and existing user.
Our partitioner enables SPMD-style parallelism that encompasses data parallelism and parameter/activation sharding.
Through a combination of inductive tactics and search in a platform-independent partitioning IR, automap can recover expert partitioning strategies such as Megatron sharding for transformer layers.
- Score: 2.469997094590327
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid rise in demand for training large neural network architectures has
brought into focus the need for partitioning strategies, for example by using
data, model, or pipeline parallelism. Implementing these methods is
increasingly supported through program primitives, but identifying efficient
partitioning strategies requires expensive experimentation and expertise. We
present the prototype of an automated partitioner that seamlessly integrates
into existing compilers and existing user workflows. Our partitioner enables
SPMD-style parallelism that encompasses data parallelism and
parameter/activation sharding. Through a combination of inductive tactics and
search in a platform-independent partitioning IR, automap can recover expert
partitioning strategies such as Megatron sharding for transformer layers.
Related papers
- PartIR: Composing SPMD Partitioning Strategies for Machine Learning [1.145010277058103]
We present PartIR, our design for a NN partitioning system.
PartIR is focused on an incremental approach to rewriting and is hardware-and-runtime agnostic.
We evaluate PartIR on several different models to demonstrate its predictability, expressibility, and ability to reach peak performance.
arXiv Detail & Related papers (2024-01-20T10:30:31Z) - Improving Automatic Parallel Training via Balanced Memory Workload
Optimization [36.87527680184956]
Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains.
We present Galvatron-BMW, a novel system framework that integrates multiple parallelism prevalent dimensions and automatically identifies the most efficient hybrid parallelism strategy.
Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints.
arXiv Detail & Related papers (2023-07-05T05:28:38Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Auto-Parallelizing Large Models with Rhino: A Systematic Approach on
Production AI Platform [15.606647290942563]
Rhino is a system for accelerating tensor programs with automatic parallelization on AI platform for real production environment.
It transforms a tensor program written for a single device into an equivalent distributed program that is capable of scaling up to thousands of devices with no user configuration.
arXiv Detail & Related papers (2023-02-16T08:19:56Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - Automatic Discovery of Composite SPMD Partitioning Strategies in PartIR [1.2507285499419876]
We present an automatic partitioner that identifies efficient combinations for many model architectures and accelerator systems.
Our key findings are that a Monte Carlo Tree Search-based partitioner leveraging partition-specific compiler analysis directly into the search and guided goals matches expert-level strategies for various models.
arXiv Detail & Related papers (2022-10-07T17:46:46Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - DistIR: An Intermediate Representation and Simulator for Efficient
Neural Network Distribution [15.086401550425125]
DistIR is a representation for distributed computation that is tailored for efficient analyses.
We show how DistIR and its simulator enable fast grid searches over complex distribution spaces spanning up to 1000+ configurations.
arXiv Detail & Related papers (2021-11-09T21:32:51Z) - DHA: End-to-End Joint Optimization of Data Augmentation Policy,
Hyper-parameter and Architecture [81.82173855071312]
We propose an end-to-end solution that integrates the AutoML components and returns a ready-to-use model at the end of the search.
Dha achieves state-of-the-art (SOTA) results on various datasets, especially 77.4% accuracy on ImageNet with cell based search space.
arXiv Detail & Related papers (2021-09-13T08:12:50Z) - Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation.
We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z) - Auto-Panoptic: Cooperative Multi-Component Architecture Search for
Panoptic Segmentation [144.50154657257605]
We propose an efficient framework to simultaneously search for all main components including backbone, segmentation branches, and feature fusion module.
Our searched architecture, namely Auto-Panoptic, achieves the new state-of-the-art on the challenging COCO and ADE20K benchmarks.
arXiv Detail & Related papers (2020-10-30T08:34:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.