LightSeq: Accelerated Training for Transformer-based Models on GPUs
- URL: http://arxiv.org/abs/2110.05722v1
- Date: Tue, 12 Oct 2021 03:17:03 GMT
- Title: LightSeq: Accelerated Training for Transformer-based Models on GPUs
- Authors: Xiaohui Wang, Ying Xiong, Xian Qian, Yang Wei, Lei Li, Mingxuan Wang
- Abstract summary: LightSeq is a system for efficient training of Transformer-based models on GPUs.
It supports a variety of network architectures, including BERT (encoder-only), GPT (decoder-only), and Transformer (encoder-decoder)
- Score: 19.02791119065971
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based models have proven to be powerful in many natural language,
computer vision, and speech recognition applications. It is expensive to train
these types of models due to unfixed input length, complex computation, and
large numbers of parameters. Existing systems either only focus on efficient
inference or optimize only BERT-like encoder models. In this paper, we present
LightSeq, a system for efficient training of Transformer-based models on GPUs.
We propose a series of GPU optimization techniques tailored to computation flow
and memory access patterns of neural layers in Transformers. LightSeq supports
a variety of network architectures, including BERT (encoder-only), GPT
(decoder-only), and Transformer (encoder-decoder). Our experiments on GPUs with
varying models and datasets show that LightSeq is 1.4-3.5x faster than previous
systems. In particular, it gains 308% training speedup compared with existing
systems on a large public machine translation benchmark (WMT14 English-German).
Related papers
- 1M parameters are enough? A lightweight CNN-based model for medical
image segmentation [0.4129225533930966]
We look for a lightweight U-Net-based model which can remain the same or achieve better performance, namely U-Lite.
We design U-Lite based on the principle of Depthwise Separable Convolution so that the model can both leverage the strength of CNNs and reduce a remarkable number of computing parameters.
Overall, U-Lite contains only 878K parameters, 35 times less than the traditional U-Net, and much more times less than other modern Transformer-based models.
arXiv Detail & Related papers (2023-06-28T11:17:37Z) - Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z) - Communication-Efficient TeraByte-Scale Model Training Framework for
Online Advertising [32.5337643852876]
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry.
We identify two major challenges in the existing GPU training for massivescale ad models.
We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
arXiv Detail & Related papers (2022-01-05T18:09:11Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - LightSeq: A High Performance Inference Library for Transformers [39.13192008249629]
LightSeq is a highly efficient inference library for Transformer models.
LightSeq includes a series of optimization techniques to streamline the neural layers and to reduce memory footprint.
arXiv Detail & Related papers (2020-10-23T13:45:26Z) - GShard: Scaling Giant Models with Conditional Computation and Automatic
Sharding [46.74457030177477]
We show how to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding.
We demonstrate such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English.
arXiv Detail & Related papers (2020-06-30T10:42:02Z) - Efficient Wait-k Models for Simultaneous Machine Translation [46.01342928010307]
Simultaneous machine translation consists in starting output generation before the entire input sequence is available.
Wait-k decoders offer a simple but efficient approach for this problem.
We investigate the behavior of wait-k decoding in low resource settings for spoken corpora using IWSLT datasets.
arXiv Detail & Related papers (2020-05-18T11:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.