Related papers: Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training

Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training

URL: http://arxiv.org/abs/2111.05972v1
Date: Wed, 10 Nov 2021 22:30:21 GMT
Title: Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
Authors: Can Karakus, Rahul Huilgol, Fei Wu, Anirudh Subramanian, Cade Daniel, Derya Cavdar, Teng Xu, Haohan Chen, Arash Rahnama, Luis Quintela
Abstract summary: We present Amazon SageMaker model parallelism, a software library that integrates with PyTorch. It enables easy training of large models using model parallelism and other memory-saving features. We evaluate performance over GPT-3, RoBERTa, BERT, and neural collaborative filtering.
Score: 10.223511922625065
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With deep learning models rapidly growing in size, systems-level solutions for large-model training are required. We present Amazon SageMaker model parallelism, a software library that integrates with PyTorch, and enables easy training of large models using model parallelism and other memory-saving features. In contrast to existing solutions, the implementation of the SageMaker library is much more generic and flexible, in that it can automatically partition and run pipeline parallelism over arbitrary model architectures with minimal code change, and also offers a general and extensible framework for tensor parallelism, which supports a wider range of use cases, and is modular enough to be easily applied to new training scripts. The library also preserves the native PyTorch user experience to a much larger degree, supporting module re-use and dynamic graphs, while giving the user full control over the details of the training step. We evaluate performance over GPT-3, RoBERTa, BERT, and neural collaborative filtering, and demonstrate competitive performance over existing solutions.

Related papers

Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval [62.904384887568284]
Asymmetric retrieval is a typical scenario in real-world retrieval systems. We propose a Prunable Network with self-compatibility, which allows developers to generate compatibleworks at any desired capacity.
arXiv Detail & Related papers (2025-04-16T08:59:47Z)
FlexModel: A Framework for Interpretability of Distributed Large Language Models [0.0]
We present FlexModel, a software package providing a streamlined interface for engaging with models distributed across multi- GPU and multi-node configurations. The library is compatible with existing model distribution libraries and encapsulates PyTorch models. It exposes user-registerable HookFunctions to facilitate straightforward interaction with distributed model internals.
arXiv Detail & Related papers (2023-12-05T21:19:33Z)
CoLLiE: Collaborative Training of Large Language Models in an Efficient Way [59.09824823710863]
CoLLiE is an efficient library that facilitates collaborative training of large language models. With its modular design and comprehensive functionality, CoLLiE offers a balanced blend of efficiency, ease of use, and customization.
arXiv Detail & Related papers (2023-12-01T08:02:16Z)
Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism [91.9372563527801]
Existing MoE models suffer from tremendous inner-node and inter-node communication overhead. We propose a novel MoE architecture called Pipeline MoE (PPMoE) to tackle them. PPMoE builds expert parallel incorporating with tensor parallel and replaces communication-intensive all-to-all dispatching and gathering.
arXiv Detail & Related papers (2023-04-22T14:09:14Z)
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z)
Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism. We implement and evaluate three common classes of compression algorithms. We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z)
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch [17.798586916628174]
OneFlow is a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than existing frameworks. OneFlow outperforms many well-known customized libraries built on top of the state-of-the-art frameworks.
arXiv Detail & Related papers (2021-10-28T11:32:14Z)
Model-Parallel Model Selection for Deep Learning Systems [0.0]
inefficiencies in machine learning (ML) training prevent practical usage of state-of-the-art models for most users. Many ML practitioners have turned to model parallelism as a method of distributing the computational requirements across several devices. We propose a new form of "shard parallelism" combining task and model parallelism, then package it into a framework we name Hydra.
arXiv Detail & Related papers (2021-07-14T03:20:37Z)
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z)
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
FlexServe: Deployment of PyTorch Models as Flexible REST Endpoints [6.730473762151365]
integration of artificial intelligence capabilities into modern software systems is increasingly being simplified through the use of cloud-based services and representational state transfer architecture. Insufficient information regarding underlying model provenance and the lack of control over model evolution serve as an impediment to the more widespread adoption of these services in many operational environments which have strict security requirements.
arXiv Detail & Related papers (2020-02-29T18:51:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.