Flover: A Temporal Fusion Framework for Efficient Autoregressive Model
Parallel Inference
- URL: http://arxiv.org/abs/2305.13484v3
- Date: Fri, 3 Nov 2023 03:37:20 GMT
- Title: Flover: A Temporal Fusion Framework for Efficient Autoregressive Model
Parallel Inference
- Authors: Jinghan Yao, Nawras Alnaasan, Tian Chen, Aamir Shafi, Hari Subramoni,
Dhabaleswar K. (DK) Panda
- Abstract summary: Inference on autoregressive models harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens.
We propose Flover -- a temporal fusion framework for efficiently inferring multiple requests in parallel.
By orchestrating the token-level parallelism, Flover exhibits hardware optimal efficiency and significantly spares the system resources.
- Score: 3.005912820808423
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoregressive models, despite their commendable performance in a myriad of
generative tasks, face challenges stemming from their inherently sequential
structure. Inference on these models, by design, harnesses a temporal
dependency, where the current token's probability distribution is conditioned
on preceding tokens. This inherent characteristic severely impedes
computational efficiency during inference as a typical inference request can
require more than thousands of tokens, where generating each token requires a
load of entire model weights, making the inference more memory-bound. The large
overhead becomes profound in real deployment where requests arrive randomly,
necessitating various generation lengths. Existing solutions, such as dynamic
batching and concurrent instances, introduce significant response delays and
bandwidth contention, falling short of achieving optimal latency and
throughput. To address these shortcomings, we propose Flover -- a temporal
fusion framework for efficiently inferring multiple requests in parallel. We
deconstruct the general generation pipeline into pre-processing and token
generation, and equip the framework with a dedicated work scheduler for fusing
the generation process temporally across all requests. By orchestrating the
token-level parallelism, Flover exhibits optimal hardware efficiency and
significantly spares the system resources. By further employing a fast buffer
reordering algorithm that allows memory eviction of finished tasks, it brings
over 11x inference speedup on GPT and 16x on LLAMA compared to the cutting-edge
solutions provided by NVIDIA FasterTransformer. Crucially, by leveraging the
advanced tensor parallel technique, Flover proves efficacious across diverse
computational landscapes, from single-GPU setups to distributed scenarios,
thereby offering robust performance optimization that adapts to variable use
cases.
Related papers
- COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference [8.527031391688283]
Kraken is an evolution of the standard Transformer architecture for efficient inference on multi-device systems.
When trained on OpenWebText, Kraken models reach a similar perplexity as standard Transformers.
When tested on the SuperGLUE benchmark, Kraken speeds up Time To First Token by a mean of 35.6% across a range of model sizes.
arXiv Detail & Related papers (2024-08-14T20:24:03Z) - PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation [9.080650575731152]
PipeInfer is a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization for single-request scenarios.
PipeInfer exhibits up to a 2.15$times$ improvement in generation speed over standard speculative inference.
arXiv Detail & Related papers (2024-07-16T14:52:02Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Pipette: Automatic Fine-grained Large Language Model Training Configurator for Real-World Clusters [5.190794062263327]
Training large language models (LLMs) is known to be challenging because of the huge computational and memory capacity requirements.
We propose Pipette, which is an automatic fine-grained LLM training for real-world clusters.
arXiv Detail & Related papers (2024-05-28T11:59:44Z) - Freya PAGE: First Optimal Time Complexity for Large-Scale Nonconvex Finite-Sum Optimization with Heterogeneous Asynchronous Computations [92.1840862558718]
In practical distributed systems, workers typically not homogeneous, and can have highly varying processing times.
We introduce a new parallel method Freya to handle arbitrarily slow computations.
We show that Freya offers significantly improved complexity guarantees compared to all previous methods.
arXiv Detail & Related papers (2024-05-24T13:33:30Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.
These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.
We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - SPEED: Speculative Pipelined Execution for Efficient Decoding [35.45955948053644]
We propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token.
For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized.
We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
arXiv Detail & Related papers (2023-10-18T16:07:01Z) - Retentive Network: A Successor to Transformer for Large Language Models [91.6652200825638]
We propose Retentive Network (RetNet) as a foundation architecture for large language models.
We theoretically derive the connection between recurrence and attention.
Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference.
arXiv Detail & Related papers (2023-07-17T16:40:01Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.