Related papers: HetSeq: Distributed GPU Training on Heterogeneous Infrastructure

HetSeq: Distributed GPU Training on Heterogeneous Infrastructure

URL: http://arxiv.org/abs/2009.14783v1
Date: Fri, 25 Sep 2020 19:57:42 GMT
Title: HetSeq: Distributed GPU Training on Heterogeneous Infrastructure
Authors: Yifan Ding, Nicholas Botzer and Tim Weninger
Abstract summary: HetSeq is a software package that provides the capability to train large neural network models on heterogeneous infrastructure. Experiments with transformer translation and BERT language model shows that HetSeq scales over heterogeneous systems.
Score: 13.689451154861203
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern deep learning systems like PyTorch and Tensorflow are able to train enormous models with billions (or trillions) of parameters on a distributed infrastructure. These systems require that the internal nodes have the same memory capacity and compute performance. Unfortunately, most organizations, especially universities, have a piecemeal approach to purchasing computer systems resulting in a heterogeneous infrastructure, which cannot be used to compute large models. The present work describes HetSeq, a software package adapted from the popular PyTorch package that provides the capability to train large neural network models on heterogeneous infrastructure. Experiments with transformer translation and BERT language model shows that HetSeq scales over heterogeneous systems. HetSeq can be easily extended to other models like image classification. Package with supported document is publicly available at https://github.com/yifding/hetseq.

Related papers

Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval [62.904384887568284]
Asymmetric retrieval is a typical scenario in real-world retrieval systems. We propose a Prunable Network with self-compatibility, which allows developers to generate compatibleworks at any desired capacity.
arXiv Detail & Related papers (2025-04-16T08:59:47Z)
FlexModel: A Framework for Interpretability of Distributed Large Language Models [0.0]
We present FlexModel, a software package providing a streamlined interface for engaging with models distributed across multi- GPU and multi-node configurations. The library is compatible with existing model distribution libraries and encapsulates PyTorch models. It exposes user-registerable HookFunctions to facilitate straightforward interaction with distributed model internals.
arXiv Detail & Related papers (2023-12-05T21:19:33Z)
TensorBank: Tensor Lakehouse for Foundation Model Training [1.8811254972035676]
Streaming and storing high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language. We introduceBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries. This architecture generalizes to other use case like computer vision, computational neuroscience, biological sequence analysis and more.
arXiv Detail & Related papers (2023-09-05T10:00:33Z)
On Optimizing the Communication of Model Parallelism [74.15423270435949]
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL) In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh. We propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule.
arXiv Detail & Related papers (2022-11-10T03:56:48Z)
Decentralized Training of Foundation Models in Heterogeneous Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive. We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z)
CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network [1.5686134908061993]
CLCNet can determine whether the classification model classifies input samples correctly. We can utilize CLCNet in a simple cascade structure system consisting of several SOTA (state-of-the-art) classification models.
arXiv Detail & Related papers (2022-05-19T15:07:53Z)
Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$ [118.04625413322827]
$texttt5x$ and $texttseqio$ are open source software libraries for building and training language models. These libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data.
arXiv Detail & Related papers (2022-03-31T17:12:13Z)
LightSeq: Accelerated Training for Transformer-based Models on GPUs [19.02791119065971]
LightSeq is a system for efficient training of Transformer-based models on GPUs. It supports a variety of network architectures, including BERT (encoder-only), GPT (decoder-only), and Transformer (encoder-decoder)
arXiv Detail & Related papers (2021-10-12T03:17:03Z)
CREPO: An Open Repository to Benchmark Credal Network Algorithms [78.79752265884109]
Credal networks are imprecise probabilistic graphical models based on, so-called credal, sets of probability mass functions. A Java library called CREMA has been recently released to model, process and query credal networks. We present CREPO, an open repository of synthetic credal networks, provided together with the exact results of inference tasks on these models.
arXiv Detail & Related papers (2021-05-10T07:31:59Z)
Diverse Branch Block: Building a Convolution as an Inception-like Unit [123.59890802196797]
We propose a universal building block of Convolutional Neural Network (ConvNet) to improve the performance without any inference-time costs. The Diverse Branch Block (DBB) enhances the representational capacity of a single convolution by combining diverse branches of different scales and complexities. After training, a DBB can be equivalently converted into a single conv layer for deployment.
arXiv Detail & Related papers (2021-03-24T18:12:00Z)
Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks. We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.