Related papers: Lite Transformer with Long-Short Range Attention

Lite Transformer with Long-Short Range Attention

URL: http://arxiv.org/abs/2004.11886v1
Date: Fri, 24 Apr 2020 17:52:25 GMT
Title: Lite Transformer with Long-Short Range Attention
Authors: Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, Song Han
Abstract summary: We present an efficient mobile NLP architecture, Lite Transformer, to facilitate deploying mobile NLP applications on edge devices. Lite Transformer outperforms transformer on WMT'14 English-French by 1.2/1.7 BLEU under constrained resources. Notably, Lite Transformer outperforms the AutoML-based Evolved Transformer by 0.5 higher BLEU for the mobile NLP setting without the costly architecture search that requires more than 250 GPU years.
Score: 31.946796118788285
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer has become ubiquitous in natural language processing (e.g., machine translation, question answering); however, it requires enormous amount of computations to achieve high performance, which makes it not suitable for mobile applications that are tightly constrained by the hardware resources and battery. In this paper, we present an efficient mobile NLP architecture, Lite Transformer to facilitate deploying mobile NLP applications on edge devices. The key primitive is the Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group specializes in the long-distance relationship modeling (by attention). Such specialization brings consistent improvement over the vanilla transformer on three well-established language tasks: machine translation, abstractive summarization, and language modeling. Under constrained resources (500M/100M MACs), Lite Transformer outperforms transformer on WMT'14 English-French by 1.2/1.7 BLEU, respectively. Lite Transformer reduces the computation of transformer base model by 2.5x with 0.3 BLEU score degradation. Combining with pruning and quantization, we further compressed the model size of Lite Transformer by 18.2x. For language modeling, Lite Transformer achieves 1.8 lower perplexity than the transformer at around 500M MACs. Notably, Lite Transformer outperforms the AutoML-based Evolved Transformer by 0.5 higher BLEU for the mobile NLP setting without the costly architecture search that requires more than 250 GPU years. Code has been made available at https://github.com/mit-han-lab/lite-transformer.

Related papers

Enhanced Transformer Architecture for Natural Language Processing [2.6071653283020915]
Transformer is a state-of-the-art model in the field of natural language processing (NLP) In this paper, a novel structure of Transformer is proposed. It is featured by full layer normalization, weighted residual connection, positional encoding exploiting reinforcement learning, and zero masked self-attention. The proposed Transformer model, which is called Enhanced Transformer, is validated by the bilingual evaluation understudy (BLEU) score obtained with the Multi30k translation dataset.
arXiv Detail & Related papers (2023-10-17T01:59:07Z)
Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST) CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background. Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z)
ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs [6.9136984255301]
We present ByteTransformer, a high-performance transformer boosted for variable-length inputs. ByteTransformer surpasses the state-of-the-art Transformer frameworks, such as PyTorch JIT, XLA, Tencent TurboTransformer and NVIDIA FasterTransformer.
arXiv Detail & Related papers (2022-10-06T16:57:23Z)
Vis-TOP: Visual Transformer Overlay Processor [9.80151619872144]
Transformer has achieved good results in Natural Language Processing (NLP) and has also started to expand into Computer Vision (CV) We propose Vis-TOP, an overlay processor for various visual Transformer models. Vis-TOP summarizes the characteristics of all visual Transformer models and implements a three-layer and two-level transformation structure.
arXiv Detail & Related papers (2021-10-21T08:11:12Z)
Dynamic Transformer for Efficient Machine Translation on Embedded Devices [0.9786690381850356]
We propose a machine translation model that scales the Transformer architecture based on the available resources at any particular time. The proposed approach, 'Dynamic-HAT', uses a HAT SuperTransformer as the backbone to search for SubTransformers with different accuracy-latency trade-offs. The Dynamic-HAT is tested on the Jetson Nano and the approach uses inherited SubTransformers sampled directly from the SuperTransformer with a switching time of 1s.
arXiv Detail & Related papers (2021-07-17T07:36:29Z)
Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z)
Glancing Transformer for Non-Autoregressive Neural Machine Translation [58.87258329683682]
We propose a method to learn word interdependency for single-pass parallel generation models. With only single-pass parallel decoding, GLAT is able to generate high-quality translation with 8-15 times speedup.
arXiv Detail & Related papers (2020-08-18T13:04:03Z)
DeLighT: Deep and Light-weight Transformer [116.9850555964728]
We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using the DeLighT transformation, a deep and light-weight transformation, and (2) across blocks using block-wise scaling.
arXiv Detail & Related papers (2020-08-03T03:08:29Z)
Segatron: Segment-Aware Transformer for Language Modeling and Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens. We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model. We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z)
Transformer on a Diet [81.09119185568296]
Transformer has been widely used thanks to its ability to capture sequence information in an efficient way. Recent developments, such as BERT and GPT-2, deliver only heavy architectures with a focus on effectiveness. We explore three carefully-designed light Transformer architectures to figure out whether the Transformer with less computations could produce competitive results.
arXiv Detail & Related papers (2020-02-14T18:41:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.