Improving Automatic Parallel Training via Balanced Memory Workload
Optimization
- URL: http://arxiv.org/abs/2307.02031v2
- Date: Sat, 24 Feb 2024 08:12:53 GMT
- Title: Improving Automatic Parallel Training via Balanced Memory Workload
Optimization
- Authors: Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu,
Xiaonan Nie, Yaofeng Tu, Bin Cui
- Abstract summary: Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains.
We present Galvatron-BMW, a novel system framework that integrates multiple parallelism prevalent dimensions and automatically identifies the most efficient hybrid parallelism strategy.
Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints.
- Score: 36.87527680184956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer models have emerged as the leading approach for achieving
state-of-the-art performance across various application domains, serving as the
foundation for advanced large-scale deep learning (DL) models. However,
efficiently training these models across multiple GPUs remains a complex
challenge due to the abundance of parallelism options. Existing DL systems
either require manual efforts to design distributed training plans or limit
parallelism combinations to a constrained search space. In this paper, we
present Galvatron-BMW, a novel system framework that integrates multiple
prevalent parallelism dimensions and automatically identifies the most
efficient hybrid parallelism strategy. To effectively navigate this vast search
space, we employ a decision tree approach for decomposition and pruning based
on intuitive insights. We further utilize a dynamic programming search
algorithm to derive the optimal plan. Moreover, to improve resource utilization
and enhance system efficiency, we propose a bi-objective optimization workflow
that focuses on workload balance. Our evaluations on different Transformer
models demonstrate the capabilities of Galvatron-BMW in automating distributed
training under varying GPU memory constraints. Across all tested scenarios,
Galvatron-BMW consistently achieves superior system throughput, surpassing
previous approaches that rely on limited parallelism strategies.
Related papers
- ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting.
atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput.
Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Galvatron: Efficient Transformer Training over Multiple GPUs Using
Automatic Parallelism [25.928940638269534]
We propose Galvatron, a framework that automatically finds the most efficient hybrid parallelism strategy.
Galvatron always achieves superior system throughput compared to previous work with limited parallelism.
arXiv Detail & Related papers (2022-11-25T03:45:31Z) - Online Weighted Q-Ensembles for Reduced Hyperparameter Tuning in
Reinforcement Learning [0.38073142980732994]
Reinforcement learning is a promising paradigm for learning robot control, allowing complex control policies to be learned without requiring a dynamics model.
We propose employing an ensemble of multiple reinforcement learning agents, each with a different set of hyper parameters, along with a mechanism for choosing the best performing set.
Online weighted Q-Ensemble presented overall lower variance and superior results when compared with q-average ensembles.
arXiv Detail & Related papers (2022-09-29T19:57:43Z) - Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel
Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Optimization-Inspired Learning with Architecture Augmentations and
Control Mechanisms for Low-Level Vision [74.9260745577362]
This paper proposes a unified optimization-inspired learning framework to aggregate Generative, Discriminative, and Corrective (GDC) principles.
We construct three propagative modules to effectively solve the optimization models with flexible combinations.
Experiments across varied low-level vision tasks validate the efficacy and adaptability of GDC.
arXiv Detail & Related papers (2020-12-10T03:24:53Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z) - Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for
DNN Workloads [11.646744408920764]
Auto-MAP is a framework for exploring distributed execution plans for workloads.
It can automatically discovering fast parallelization strategies through reinforcement learning on IR level of deep learning models.
Our evaluation shows that Auto-MAP can find the optimal solution in two hours, while achieving better throughput on several NLP and convolution models.
arXiv Detail & Related papers (2020-07-08T12:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.