Related papers: FastSeq: Make Sequence Generation Faster

FastSeq: Make Sequence Generation Faster

URL: http://arxiv.org/abs/2106.04718v1
Date: Tue, 8 Jun 2021 22:25:28 GMT
Title: FastSeq: Make Sequence Generation Faster
Authors: Yu Yan, Fei Hu, Jiusheng Chen, Nikhil Bhendawade, Ting Ye, Yeyun Gong, Nan Duan, Desheng Cui, Bingyu Chi and Ruifei Zhang
Abstract summary: We develop FastSeq framework to accelerate sequence generation without accuracy loss. benchmark results on a set of widely used and diverse models demonstrate 4-9x inference speed gain. FastSeq is easy to use with a simple one-line code change.
Score: 20.920579109726024
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based models have made tremendous impacts in natural language generation. However the inference speed is a bottleneck due to large model size and intensive computing involved in auto-regressive decoding process. We develop FastSeq framework to accelerate sequence generation without accuracy loss. The proposed optimization techniques include an attention cache optimization, an efficient algorithm for detecting repeated n-grams, and an asynchronous generation pipeline with parallel I/O. These optimizations are general enough to be applicable to Transformer-based models (e.g., T5, GPT2, and UniLM). Our benchmark results on a set of widely used and diverse models demonstrate 4-9x inference speed gain. Additionally, FastSeq is easy to use with a simple one-line code change. The source code is available at https://github.com/microsoft/fastseq.

Related papers

Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding. We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z)
Planning with Large Language Models for Code Generation [100.07232672883897]
Planning-Guided Transformer Decoding (PG-TD) uses a planning algorithm to do lookahead search and guide the Transformer to generate better programs. We empirically evaluate our framework with several large language models as backbones on public coding challenge benchmarks.
arXiv Detail & Related papers (2023-03-09T18:59:47Z)
Fast Inference from Transformers via Speculative Decoding [3.950600027250452]
Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel.
arXiv Detail & Related papers (2022-11-30T17:33:28Z)
Fast DistilBERT on CPUs [13.29188219884869]
Transformer-based language models have become the standard approach to solving natural language processing tasks. Industry adoption usually requires the maximum throughput to comply with certain latency constraints. We propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators.
arXiv Detail & Related papers (2022-10-27T07:22:50Z)
A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models. Prior work on model pruning requires retraining the model. We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z)
Cascaded Fast and Slow Models for Efficient Semantic Code Search [46.53530668938728]
We propose an efficient and accurate semantic code search framework with cascaded fast and slow models. The proposed cascaded approach is not only efficient and scalable, but also achieves state-of-the-art results.
arXiv Detail & Related papers (2021-10-15T02:23:35Z)
Fastformer: Additive Attention Can Be All You Need [51.79399904527525]
We propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts. In this way, Fastformer can achieve effective context modeling with linear complexity.
arXiv Detail & Related papers (2021-08-20T09:44:44Z)
Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding [57.08875260900373]
We propose Shallow Aggressive Decoding (SAD) to improve the online inference efficiency of the Transformer for instantaneous Grammatical Error Correction (GEC) SAD aggressively decodes as many tokens as possible in parallel instead of always decoding only one token in each step to improve computational parallelism. Experiments in both English and Chinese GEC benchmarks show that aggressive decoding could yield the same predictions but with a significant speedup for online inference.
arXiv Detail & Related papers (2021-06-09T10:30:59Z)
Shortformer: Better Language Modeling using Shorter Inputs [62.51758040848735]
We show that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time. We then show how to improve the efficiency of recurrence methods in transformers.
arXiv Detail & Related papers (2020-12-31T18:52:59Z)
LightSeq: A High Performance Inference Library for Transformers [39.13192008249629]
LightSeq is a highly efficient inference library for Transformer models. LightSeq includes a series of optimization techniques to streamline the neural layers and to reduce memory footprint.
arXiv Detail & Related papers (2020-10-23T13:45:26Z)
FastWave: Accelerating Autoregressive Convolutional Neural Networks on FPGA [27.50143717931293]
WaveNet is a deep autoregressive CNN composed of several stacked layers of dilated convolution. We develop the first accelerator platformtextitFastWave for autoregressive convolutional neural networks.
arXiv Detail & Related papers (2020-02-09T06:15:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.