LightSeq: A High Performance Inference Library for Transformers
- URL: http://arxiv.org/abs/2010.13887v4
- Date: Thu, 22 Apr 2021 09:37:37 GMT
- Title: LightSeq: A High Performance Inference Library for Transformers
- Authors: Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, Lei Li
- Abstract summary: LightSeq is a highly efficient inference library for Transformer models.
LightSeq includes a series of optimization techniques to streamline the neural layers and to reduce memory footprint.
- Score: 39.13192008249629
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer, BERT and their variants have achieved great success in natural
language processing. Since Transformer models are huge in size, serving these
models is a challenge for real industrial applications. In this paper, we
propose LightSeq, a highly efficient inference library for models in the
Transformer family. LightSeq includes a series of GPU optimization techniques
to to streamline the computation of neural layers and to reduce memory
footprint. LightSeq can easily import models trained using PyTorch and
Tensorflow. Experimental results on machine translation benchmarks show that
LightSeq achieves up to 14x speedup compared with TensorFlow and 1.4x compared
with FasterTransformer, a concurrent CUDA implementation. The code is available
at https://github.com/bytedance/lightseq.
Related papers
- Automatic Task Parallelization of Dataflow Graphs in ML/DL models [0.0]
We present a Linear Clustering approach to exploit inherent parallel paths in ML dataflow graphs.
We generate readable and executable parallel Pytorch+Python code from input ML models in ONNX format.
Preliminary results on several ML graphs demonstrate up to 1.9$times$ speedup over serial execution.
arXiv Detail & Related papers (2023-08-22T04:54:30Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - Fast Inference from Transformers via Speculative Decoding [3.950600027250452]
Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model.
In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel.
arXiv Detail & Related papers (2022-11-30T17:33:28Z) - LightSeq: Accelerated Training for Transformer-based Models on GPUs [19.02791119065971]
LightSeq is a system for efficient training of Transformer-based models on GPUs.
It supports a variety of network architectures, including BERT (encoder-only), GPT (decoder-only), and Transformer (encoder-decoder)
arXiv Detail & Related papers (2021-10-12T03:17:03Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - FastSeq: Make Sequence Generation Faster [20.920579109726024]
We develop FastSeq framework to accelerate sequence generation without accuracy loss.
benchmark results on a set of widely used and diverse models demonstrate 4-9x inference speed gain.
FastSeq is easy to use with a simple one-line code change.
arXiv Detail & Related papers (2021-06-08T22:25:28Z) - FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs.
We replace the self-attention sublayers with simple linear transformations that "mix" input tokens.
The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z) - Glancing Transformer for Non-Autoregressive Neural Machine Translation [58.87258329683682]
We propose a method to learn word interdependency for single-pass parallel generation models.
With only single-pass parallel decoding, GLAT is able to generate high-quality translation with 8-15 times speedup.
arXiv Detail & Related papers (2020-08-18T13:04:03Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.