Lite Transformer with Long-Short Range Attention
- URL: http://arxiv.org/abs/2004.11886v1
- Date: Fri, 24 Apr 2020 17:52:25 GMT
- Title: Lite Transformer with Long-Short Range Attention
- Authors: Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, Song Han
- Abstract summary: We present an efficient mobile NLP architecture, Lite Transformer, to facilitate deploying mobile NLP applications on edge devices.
Lite Transformer outperforms transformer on WMT'14 English-French by 1.2/1.7 BLEU under constrained resources.
Notably, Lite Transformer outperforms the AutoML-based Evolved Transformer by 0.5 higher BLEU for the mobile NLP setting without the costly architecture search that requires more than 250 GPU years.
- Score: 31.946796118788285
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer has become ubiquitous in natural language processing (e.g.,
machine translation, question answering); however, it requires enormous amount
of computations to achieve high performance, which makes it not suitable for
mobile applications that are tightly constrained by the hardware resources and
battery. In this paper, we present an efficient mobile NLP architecture, Lite
Transformer to facilitate deploying mobile NLP applications on edge devices.
The key primitive is the Long-Short Range Attention (LSRA), where one group of
heads specializes in the local context modeling (by convolution) while another
group specializes in the long-distance relationship modeling (by attention).
Such specialization brings consistent improvement over the vanilla transformer
on three well-established language tasks: machine translation, abstractive
summarization, and language modeling. Under constrained resources (500M/100M
MACs), Lite Transformer outperforms transformer on WMT'14 English-French by
1.2/1.7 BLEU, respectively. Lite Transformer reduces the computation of
transformer base model by 2.5x with 0.3 BLEU score degradation. Combining with
pruning and quantization, we further compressed the model size of Lite
Transformer by 18.2x. For language modeling, Lite Transformer achieves 1.8
lower perplexity than the transformer at around 500M MACs. Notably, Lite
Transformer outperforms the AutoML-based Evolved Transformer by 0.5 higher BLEU
for the mobile NLP setting without the costly architecture search that requires
more than 250 GPU years. Code has been made available at
https://github.com/mit-han-lab/lite-transformer.
Related papers
- Enhanced Transformer Architecture for Natural Language Processing [2.6071653283020915]
Transformer is a state-of-the-art model in the field of natural language processing (NLP)
In this paper, a novel structure of Transformer is proposed. It is featured by full layer normalization, weighted residual connection, positional encoding exploiting reinforcement learning, and zero masked self-attention.
The proposed Transformer model, which is called Enhanced Transformer, is validated by the bilingual evaluation understudy (BLEU) score obtained with the Multi30k translation dataset.
arXiv Detail & Related papers (2023-10-17T01:59:07Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - ByteTransformer: A High-Performance Transformer Boosted for
Variable-Length Inputs [6.9136984255301]
We present ByteTransformer, a high-performance transformer boosted for variable-length inputs.
ByteTransformer surpasses the state-of-the-art Transformer frameworks, such as PyTorch JIT, XLA, Tencent TurboTransformer and NVIDIA FasterTransformer.
arXiv Detail & Related papers (2022-10-06T16:57:23Z) - Vis-TOP: Visual Transformer Overlay Processor [9.80151619872144]
Transformer has achieved good results in Natural Language Processing (NLP) and has also started to expand into Computer Vision (CV)
We propose Vis-TOP, an overlay processor for various visual Transformer models.
Vis-TOP summarizes the characteristics of all visual Transformer models and implements a three-layer and two-level transformation structure.
arXiv Detail & Related papers (2021-10-21T08:11:12Z) - Dynamic Transformer for Efficient Machine Translation on Embedded
Devices [0.9786690381850356]
We propose a machine translation model that scales the Transformer architecture based on the available resources at any particular time.
The proposed approach, 'Dynamic-HAT', uses a HAT SuperTransformer as the backbone to search for SubTransformers with different accuracy-latency trade-offs.
The Dynamic-HAT is tested on the Jetson Nano and the approach uses inherited SubTransformers sampled directly from the SuperTransformer with a switching time of 1s.
arXiv Detail & Related papers (2021-07-17T07:36:29Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Glancing Transformer for Non-Autoregressive Neural Machine Translation [58.87258329683682]
We propose a method to learn word interdependency for single-pass parallel generation models.
With only single-pass parallel decoding, GLAT is able to generate high-quality translation with 8-15 times speedup.
arXiv Detail & Related papers (2020-08-18T13:04:03Z) - DeLighT: Deep and Light-weight Transformer [116.9850555964728]
We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters.
DeLighT more efficiently allocates parameters both (1) within each Transformer block using the DeLighT transformation, a deep and light-weight transformation, and (2) across blocks using block-wise scaling.
arXiv Detail & Related papers (2020-08-03T03:08:29Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z) - Transformer on a Diet [81.09119185568296]
Transformer has been widely used thanks to its ability to capture sequence information in an efficient way.
Recent developments, such as BERT and GPT-2, deliver only heavy architectures with a focus on effectiveness.
We explore three carefully-designed light Transformer architectures to figure out whether the Transformer with less computations could produce competitive results.
arXiv Detail & Related papers (2020-02-14T18:41:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.