ByteTransformer: A High-Performance Transformer Boosted for
Variable-Length Inputs
- URL: http://arxiv.org/abs/2210.03052v1
- Date: Thu, 6 Oct 2022 16:57:23 GMT
- Title: ByteTransformer: A High-Performance Transformer Boosted for
Variable-Length Inputs
- Authors: Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang,
Zizhong Chen, Xin Liu, Yibo Zhu
- Abstract summary: We present ByteTransformer, a high-performance transformer boosted for variable-length inputs.
ByteTransformer surpasses the state-of-the-art Transformer frameworks, such as PyTorch JIT, XLA, Tencent TurboTransformer and NVIDIA FasterTransformer.
- Score: 6.9136984255301
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Transformer is the cornerstone model of Natural Language Processing (NLP)
over the past decade. Despite its great success in Deep Learning (DL)
applications, the increasingly growing parameter space required by transformer
models boosts the demand on accelerating the performance of transformer models.
In addition, NLP problems can commonly be faced with variable-length sequences
since their word numbers can vary among sentences. Existing DL frameworks need
to pad variable-length sequences to the maximal length, which, however, leads
to significant memory and computational overhead. In this paper, we present
ByteTransformer, a high-performance transformer boosted for variable-length
inputs. We propose a zero padding algorithm that enables the whole transformer
to be free from redundant computations on useless padded tokens. Besides the
algorithmic level optimization, we provide architectural-aware optimizations
for transformer functioning modules, especially the performance-critical
algorithm, multi-head attention (MHA). Experimental results on an NVIDIA A100
GPU with variable-length sequence inputs validate that our fused MHA (FMHA)
outperforms the standard PyTorch MHA by 6.13X. The end-to-end performance of
ByteTransformer for a standard BERT transformer model surpasses the
state-of-the-art Transformer frameworks, such as PyTorch JIT, TensorFlow XLA,
Tencent TurboTransformer and NVIDIA FasterTransformer, by 87\%, 131\%, 138\%
and 46\%, respectively.
Related papers
- MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - On the Expressive Power of a Variant of the Looped Transformer [83.30272757948829]
We design a novel transformer block, dubbed AlgoFormer, to empower transformers with algorithmic capabilities.
The proposed AlgoFormer can achieve significantly higher in algorithm representation when using the same number of parameters.
Some theoretical and empirical results are presented to show that the designed transformer has the potential to be smarter than human-designed algorithms.
arXiv Detail & Related papers (2024-02-21T07:07:54Z) - Enhanced Transformer Architecture for Natural Language Processing [2.6071653283020915]
Transformer is a state-of-the-art model in the field of natural language processing (NLP)
In this paper, a novel structure of Transformer is proposed. It is featured by full layer normalization, weighted residual connection, positional encoding exploiting reinforcement learning, and zero masked self-attention.
The proposed Transformer model, which is called Enhanced Transformer, is validated by the bilingual evaluation understudy (BLEU) score obtained with the Multi30k translation dataset.
arXiv Detail & Related papers (2023-10-17T01:59:07Z) - Linearizing Transformer with Key-Value Memory Bank [54.83663647680612]
We propose MemSizer, an approach to project the source sequence into lower dimension representation.
MemSizer not only achieves the same linear time complexity but also enjoys efficient recurrent-style autoregressive generation.
We demonstrate that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer.
arXiv Detail & Related papers (2022-03-23T18:10:18Z) - Sparse is Enough in Scaling Transformers [12.561317511514469]
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach.
We propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer.
arXiv Detail & Related papers (2021-11-24T19:53:46Z) - Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU [19.103130032967663]
Incremental processing allows interactive systems to respond based on partial inputs.
Recent work attempts to apply Transformers incrementally via restart-incrementality.
This approach is computationally costly and does not scale efficiently for long sequences.
arXiv Detail & Related papers (2021-09-15T15:20:29Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - TurboTransformers: An Efficient GPU Serving System For Transformer
Models [17.4637724940437]
The TurboTransformers system consists of a computing runtime and a serving framework.
An efficient parallel algorithm is proposed for GPU-based batch reduction operations.
A memory allocation algorithm is designed for variable-length input situations.
A serving framework equipped with a new batch scheduler achieves the optimal throughput on variable-length requests.
arXiv Detail & Related papers (2020-10-09T07:28:38Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.