Related papers: Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

URL: http://arxiv.org/abs/2305.13484v3
Date: Fri, 3 Nov 2023 03:37:20 GMT
Title: Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference
Authors: Jinghan Yao, Nawras Alnaasan, Tian Chen, Aamir Shafi, Hari Subramoni, Dhabaleswar K. (DK) Panda
Abstract summary: Inference on autoregressive models harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens. We propose Flover -- a temporal fusion framework for efficiently inferring multiple requests in parallel. By orchestrating the token-level parallelism, Flover exhibits hardware optimal efficiency and significantly spares the system resources.
Score: 3.005912820808423
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive models, despite their commendable performance in a myriad of generative tasks, face challenges stemming from their inherently sequential structure. Inference on these models, by design, harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens. This inherent characteristic severely impedes computational efficiency during inference as a typical inference request can require more than thousands of tokens, where generating each token requires a load of entire model weights, making the inference more memory-bound. The large overhead becomes profound in real deployment where requests arrive randomly, necessitating various generation lengths. Existing solutions, such as dynamic batching and concurrent instances, introduce significant response delays and bandwidth contention, falling short of achieving optimal latency and throughput. To address these shortcomings, we propose Flover -- a temporal fusion framework for efficiently inferring multiple requests in parallel. We deconstruct the general generation pipeline into pre-processing and token generation, and equip the framework with a dedicated work scheduler for fusing the generation process temporally across all requests. By orchestrating the token-level parallelism, Flover exhibits optimal hardware efficiency and significantly spares the system resources. By further employing a fast buffer reordering algorithm that allows memory eviction of finished tasks, it brings over 11x inference speedup on GPT and 16x on LLAMA compared to the cutting-edge solutions provided by NVIDIA FasterTransformer. Crucially, by leveraging the advanced tensor parallel technique, Flover proves efficacious across diverse computational landscapes, from single-GPU setups to distributed scenarios, thereby offering robust performance optimization that adapts to variable use cases.

Related papers

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources. This paper formulates LLM inference optimization as a multi-stage online scheduling problem. We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z)
Recurrent Diffusion for Large-Scale Parameter Generation [52.98888368644455]
We introduce Recurrent Diffusion for Large Scale Generation (RPG), a novel framework that generates full neural network parameters up to hundreds of millions on a single GPU. RPG serves as a critical advance in AI generating AI, potentially enabling efficient weight generation at scales previously deemed infeasible.
arXiv Detail & Related papers (2025-01-20T16:46:26Z)
Parallelized Autoregressive Visual Generation [65.9579525736345]
We propose a simple yet effective approach for parallelized autoregressive visual generation. Our method achieves a 3.6x speedup with comparable quality and up to 9.5x speedup with minimal quality degradation across both image and video generation tasks.
arXiv Detail & Related papers (2024-12-19T17:59:54Z)
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z)
Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference [8.527031391688283]
Kraken is an evolution of the standard Transformer architecture for efficient inference on multi-device systems. When trained on OpenWebText, Kraken models reach a similar perplexity as standard Transformers. When tested on the SuperGLUE benchmark, Kraken speeds up Time To First Token by a mean of 35.6% across a range of model sizes.
arXiv Detail & Related papers (2024-08-14T20:24:03Z)
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation [9.080650575731152]
PipeInfer is a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization for single-request scenarios. PipeInfer exhibits up to a 2.15$times$ improvement in generation speed over standard speculative inference.
arXiv Detail & Related papers (2024-07-16T14:52:02Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
Pipette: Automatic Fine-grained Large Language Model Training Configurator for Real-World Clusters [5.190794062263327]
Training large language models (LLMs) is known to be challenging because of the huge computational and memory capacity requirements. We propose Pipette, which is an automatic fine-grained LLM training for real-world clusters.
arXiv Detail & Related papers (2024-05-28T11:59:44Z)
Freya PAGE: First Optimal Time Complexity for Large-Scale Nonconvex Finite-Sum Optimization with Heterogeneous Asynchronous Computations [92.1840862558718]
In practical distributed systems, workers typically not homogeneous, and can have highly varying processing times. We introduce a new parallel method Freya to handle arbitrarily slow computations. We show that Freya offers significantly improved complexity guarantees compared to all previous methods.
arXiv Detail & Related papers (2024-05-24T13:33:30Z)
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators. We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
SPEED: Speculative Pipelined Execution for Efficient Decoding [35.45955948053644]
We propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token. For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized. We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
arXiv Detail & Related papers (2023-10-18T16:07:01Z)
Retentive Network: A Successor to Transformer for Large Language Models [91.6652200825638]
We propose Retentive Network (RetNet) as a foundation architecture for large language models. We theoretically derive the connection between recurrence and attention. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference.
arXiv Detail & Related papers (2023-07-17T16:40:01Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.