Easy and Efficient Transformer : Scalable Inference Solution For large
NLP mode
- URL: http://arxiv.org/abs/2104.12470v1
- Date: Mon, 26 Apr 2021 11:00:56 GMT
- Title: Easy and Efficient Transformer : Scalable Inference Solution For large
NLP mode
- Authors: Gongzheng li, Yadong Xi, Jingzhen Ding, Duan Wang, Bai Liu, Changjie
Fan, Xiaoxi Mao, Zeng Zhao
- Abstract summary: This paper introduces a series of ultra-large-scale pre-training model optimization methods.
An inference engine -- Easy and Efficient Transformer (EET) is proposed.
EET achieves a 1.5-15x state-of-art speedup varying with context length.
- Score: 14.321889138798072
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ultra-large-scale pre-training model can effectively improve the effect
of a variety of tasks, and it also brings a heavy computational burden to
inference. This paper introduces a series of ultra-large-scale pre-training
model optimization methods that combine algorithm characteristics and GPU
processor hardware characteristics, and on this basis, propose an inference
engine -- Easy and Efficient Transformer (EET), Which has a significant
performance improvement over the existing schemes.
We firstly introduce a pre-padding decoding mechanism that improves token
parallelism for generation tasks. Then we design high optimized kernels to
remove sequence masks and achieve cost-free calculation for padding tokens, as
well as support long sequence and long embedding sizes. Thirdly a user-friendly
inference system with an easy service pipeline was introduced which greatly
reduces the difficulty of engineering deployment with high throughput. Compared
to Faster Transformer's implementation for GPT-2 on A100, EET achieves a
1.5-15x state-of-art speedup varying with context length.EET is available
https://github.com/NetEase-FuXi/EET.
Related papers
- Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - SWAT: Scalable and Efficient Window Attention-based Transformers Acceleration on FPGAs [3.302913401404089]
Sliding window-based static sparse attention mitigates the problem by limiting the attention scope of the input tokens.
We propose a dataflow-aware FPGA-based accelerator design, SWAT, that efficiently leverages the sparsity to achieve scalable performance for long input.
arXiv Detail & Related papers (2024-05-27T10:25:08Z) - Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR)
CFSR inherits the advantages of both convolution-based and transformer-based approaches.
Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - Practical Conformer: Optimizing size, speed and flops of Conformer for
on-Device and cloud ASR [67.63332492134332]
We design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs.
Our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
arXiv Detail & Related papers (2023-03-31T23:30:48Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse
Transformers [11.811907838840712]
We propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns.
We present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers.
Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency.
arXiv Detail & Related papers (2022-08-12T04:51:49Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - TurboTransformers: An Efficient GPU Serving System For Transformer
Models [17.4637724940437]
The TurboTransformers system consists of a computing runtime and a serving framework.
An efficient parallel algorithm is proposed for GPU-based batch reduction operations.
A memory allocation algorithm is designed for variable-length input situations.
A serving framework equipped with a new batch scheduler achieves the optimal throughput on variable-length requests.
arXiv Detail & Related papers (2020-10-09T07:28:38Z) - FTRANS: Energy-Efficient Acceleration of Transformers using FPGA [11.032972017827248]
We propose an efficient acceleration framework, Ftrans, for transformer-based large scale language representations.
Our framework significantly reduces the model size of NLP models by up to 16 times.
Our FPGA design achieves 27.07x and 81x improvement in performance and energy efficiency compared to CPU, and up to 8.80x improvement in energy efficiency compared to GPU.
arXiv Detail & Related papers (2020-07-16T18:58:31Z) - An Efficient Accelerator Design Methodology for Deformable Convolutional
Networks [16.392643034008348]
We present a novel approach to accelerate deformable convolution on FPGA.
By optimizing the receptive field, we can compress the maximum size of the receptive field by 12.6 times.
Our accelerator achieves up to 17.25 times speedup over the state-of-the-art accelerator.
arXiv Detail & Related papers (2020-06-09T13:16:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.