DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and
DeepSpeed-Inference
- URL: http://arxiv.org/abs/2401.08671v1
- Date: Tue, 9 Jan 2024 06:49:40 GMT
- Title: DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and
DeepSpeed-Inference
- Authors: Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff
Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash
Bakhtiari, Lev Kurilenko, Yuxiong He
- Abstract summary: This paper introduces DeepSpeed-FastGen, a system that delivers up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency.
We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for large language models.
- Score: 23.49242865222089
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The deployment and scaling of large language models (LLMs) have become
critical as they permeate various applications, demanding high-throughput and
low-latency serving systems. Existing frameworks struggle to balance these
requirements, especially for workloads with long prompts. This paper introduces
DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and
generation composition strategy, to deliver up to 2.3x higher effective
throughput, 2x lower latency on average, and up to 3.7x lower (token-level)
tail latency, compared to state-of-the-art systems like vLLM. We leverage a
synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an
efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced
implementation supports a range of models and offers both non-persistent and
persistent deployment options, catering to diverse user scenarios from
interactive sessions to long-running applications. We present a detailed
benchmarking methodology, analyze the performance through latency-throughput
curves, and investigate scalability via load balancing. Our evaluations
demonstrate substantial improvements in throughput and latency across various
models and hardware configurations. We discuss our roadmap for future
enhancements, including broader model support and new hardware backends. The
DeepSpeed-FastGen code is readily available for community engagement and
contribution.
Related papers
- Inference Optimization of Foundation Models on AI Accelerators [68.24450520773688]
Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI.
As the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios.
This tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators.
arXiv Detail & Related papers (2024-07-12T09:24:34Z) - FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor Contraction [3.6640504352010885]
This paper introduces FLAASH, a flexible and modular accelerator design for sparse tensor contraction.
Our architecture performs sparse high-order tensor contraction by distributing sparse dot products, or portions thereof, to numerous Sparse Dot Product Engines.
The effectiveness of our approach is demonstrated through various evaluations, showcasing significant speedup as sparsity and order increase.
arXiv Detail & Related papers (2024-04-25T03:46:53Z) - MARS: Exploiting Multi-Level Parallelism for DNN Workloads on Adaptive
Multi-Accelerator Systems [27.490645446510033]
We propose a novel mapping framework that can perform computation-aware accelerator selection and apply communication-aware sharding strategies to maximize parallelism.
We show that MARS can achieve 32.2% latency reduction on average for typical DNN workloads compared to the baseline, and 59.4% latency reduction on heterogeneous models compared to the corresponding state-of-the-art method.
arXiv Detail & Related papers (2023-07-23T05:50:37Z) - A GPU-specialized Inference Parameter Server for Large-Scale Deep
Recommendation Models [6.823233135936128]
Recommendation systems are crucial for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc.
To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data.
Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale.
arXiv Detail & Related papers (2022-10-17T07:36:18Z) - Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural
Networks on Edge NPUs [74.83613252825754]
"smart ecosystems" are being formed where sensing happens concurrently rather than standalone.
This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge.
We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
arXiv Detail & Related papers (2022-09-27T15:04:01Z) - Accelerating Deep Learning Model Inference on Arm CPUs with Ultra-Low
Bit Quantization and Runtime [57.5143536744084]
High performance of deep learning models comes at the expense of high computational, storage and power requirements.
We introduce Deeplite Neutrino for production-ready optimization of the models and Deeplite for deployment of ultra-low bit quantized models on Arm-based platforms.
arXiv Detail & Related papers (2022-07-18T15:05:17Z) - An Intelligent Deterministic Scheduling Method for Ultra-Low Latency
Communication in Edge Enabled Industrial Internet of Things [19.277349546331557]
Time Sensitive Network (TSN) is recently researched to realize low latency communication via deterministic scheduling.
Non-collision theory based deterministic scheduling (NDS) method is proposed to achieve ultra-low latency communication for the time-sensitive flows.
Experiment results demonstrate that NDS/DQS can well support deterministic ultra-low latency services and guarantee efficient bandwidth utilization.
arXiv Detail & Related papers (2022-07-17T16:52:51Z) - MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process.
We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z) - Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for
5G and Beyond [70.81551587109833]
nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity.
One of the main challenges comes from the real-time implementation of these algorithms.
This paper explores the acceleration of APSM-based algorithms through massive parallelization.
arXiv Detail & Related papers (2022-01-13T15:20:45Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z) - Towards High Performance Java-based Deep Learning Frameworks [0.22940141855172028]
Modern cloud services have set the demand for fast and efficient data processing.
This demand is common among numerous application domains, such as deep learning, data mining, and computer vision.
In this paper we have employed TornadoVM, a state-of-the-art programming framework to transparently accelerate Deep Netts; a Java-based deep learning framework.
arXiv Detail & Related papers (2020-01-13T13:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.