HetSeq: Distributed GPU Training on Heterogeneous Infrastructure
- URL: http://arxiv.org/abs/2009.14783v1
- Date: Fri, 25 Sep 2020 19:57:42 GMT
- Title: HetSeq: Distributed GPU Training on Heterogeneous Infrastructure
- Authors: Yifan Ding, Nicholas Botzer and Tim Weninger
- Abstract summary: HetSeq is a software package that provides the capability to train large neural network models on heterogeneous infrastructure.
Experiments with transformer translation and BERT language model shows that HetSeq scales over heterogeneous systems.
- Score: 13.689451154861203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern deep learning systems like PyTorch and Tensorflow are able to train
enormous models with billions (or trillions) of parameters on a distributed
infrastructure. These systems require that the internal nodes have the same
memory capacity and compute performance. Unfortunately, most organizations,
especially universities, have a piecemeal approach to purchasing computer
systems resulting in a heterogeneous infrastructure, which cannot be used to
compute large models. The present work describes HetSeq, a software package
adapted from the popular PyTorch package that provides the capability to train
large neural network models on heterogeneous infrastructure. Experiments with
transformer translation and BERT language model shows that HetSeq scales over
heterogeneous systems. HetSeq can be easily extended to other models like image
classification. Package with supported document is publicly available at
https://github.com/yifding/hetseq.
Related papers
- FlexModel: A Framework for Interpretability of Distributed Large
Language Models [0.0]
We present FlexModel, a software package providing a streamlined interface for engaging with models distributed across multi- GPU and multi-node configurations.
The library is compatible with existing model distribution libraries and encapsulates PyTorch models.
It exposes user-registerable HookFunctions to facilitate straightforward interaction with distributed model internals.
arXiv Detail & Related papers (2023-12-05T21:19:33Z) - TensorBank: Tensor Lakehouse for Foundation Model Training [1.8811254972035676]
Streaming and storing high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language.
We introduceBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries.
This architecture generalizes to other use case like computer vision, computational neuroscience, biological sequence analysis and more.
arXiv Detail & Related papers (2023-09-05T10:00:33Z) - On Optimizing the Communication of Model Parallelism [74.15423270435949]
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL)
In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh.
We propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule.
arXiv Detail & Related papers (2022-11-10T03:56:48Z) - Decentralized Training of Foundation Models in Heterogeneous
Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive.
We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z) - CLCNet: Rethinking of Ensemble Modeling with Classification Confidence
Network [1.5686134908061993]
CLCNet can determine whether the classification model classifies input samples correctly.
We can utilize CLCNet in a simple cascade structure system consisting of several SOTA (state-of-the-art) classification models.
arXiv Detail & Related papers (2022-05-19T15:07:53Z) - Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$ [118.04625413322827]
$texttt5x$ and $texttseqio$ are open source software libraries for building and training language models.
These libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data.
arXiv Detail & Related papers (2022-03-31T17:12:13Z) - LightSeq: Accelerated Training for Transformer-based Models on GPUs [19.02791119065971]
LightSeq is a system for efficient training of Transformer-based models on GPUs.
It supports a variety of network architectures, including BERT (encoder-only), GPT (decoder-only), and Transformer (encoder-decoder)
arXiv Detail & Related papers (2021-10-12T03:17:03Z) - CREPO: An Open Repository to Benchmark Credal Network Algorithms [78.79752265884109]
Credal networks are imprecise probabilistic graphical models based on, so-called credal, sets of probability mass functions.
A Java library called CREMA has been recently released to model, process and query credal networks.
We present CREPO, an open repository of synthetic credal networks, provided together with the exact results of inference tasks on these models.
arXiv Detail & Related papers (2021-05-10T07:31:59Z) - Diverse Branch Block: Building a Convolution as an Inception-like Unit [123.59890802196797]
We propose a universal building block of Convolutional Neural Network (ConvNet) to improve the performance without any inference-time costs.
The Diverse Branch Block (DBB) enhances the representational capacity of a single convolution by combining diverse branches of different scales and complexities.
After training, a DBB can be equivalently converted into a single conv layer for deployment.
arXiv Detail & Related papers (2021-03-24T18:12:00Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.