DeepSpeed Inference: Enabling Efficient Inference of Transformer Models
  at Unprecedented Scale
        - URL: http://arxiv.org/abs/2207.00032v1
- Date: Thu, 30 Jun 2022 18:01:08 GMT
- Title: DeepSpeed Inference: Enabling Efficient Inference of Transformer Models
  at Unprecedented Scale
- Authors: Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad
  Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji
  Ruwase, Yuxiong He
- Abstract summary: DeepSpeed Inference is a comprehensive system solution for transformer model inference.
It reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios.
It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over $50%$ of A6000 peak)
- Score: 20.558091867632445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   The past several years have witnessed the success of transformer-based
models, and their scale and application scenarios continue to grow
aggressively. The current landscape of transformer models is increasingly
diverse: the model size varies drastically with the largest being of
hundred-billion parameters; the model characteristics differ due to the
sparsity introduced by the Mixture-of-Experts; the target application scenarios
can be latency-critical or throughput-oriented; the deployment hardware could
be single- or multi-GPU systems with different types of memory and storage,
etc. With such increasing diversity and the fast-evolving pace of transformer
models, designing a highly performant and efficient inference system is
extremely challenging. In this paper, we present DeepSpeed Inference, a
comprehensive system solution for transformer model inference to address the
above-mentioned challenges. DeepSpeed Inference consists of (1) a multi-GPU
inference solution to minimize latency while maximizing the throughput of both
dense and sparse transformer models when they fit in aggregate GPU memory, and
(2) a heterogeneous inference solution that leverages CPU and NVMe memory in
addition to the GPU memory and compute to enable high inference throughput with
large models which do not fit in aggregate GPU memory. DeepSpeed Inference
reduces latency by up to 7.3X over the state-of-the-art for latency-oriented
scenarios and increases throughput by over 1.5x for throughput-oriented
scenarios. Moreover, it enables trillion parameter scale inference under
real-time latency constraints by leveraging hundreds of GPUs, an unprecedented
scale for inference. It can inference 25x larger models than with GPU-only
solutions, while delivering a high throughput of 84 TFLOPS (over $50\%$ of
A6000 peak).
 
      
        Related papers
        - MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated   Expert Parallelism [26.923312725688735]
 Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity.
We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models.
 arXiv  Detail & Related papers  (2025-04-03T04:20:44Z)
- Ultra-Sparse Memory Network [8.927205198458994]
 This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations.
We show that our method achieves state-of-the-art inference speed and model performance within a given computational budget.
 arXiv  Detail & Related papers  (2024-11-19T09:24:34Z)
- MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
 MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
 arXiv  Detail & Related papers  (2024-11-18T01:06:12Z)
- Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge   Devices [19.96064012736243]
 This paper introduces PIPELOAD, a memory-efficient pipeline execution mechanism.
It reduces memory usage by incorporating dynamic memory management and minimizes inference latency.
We present Hermes, a framework optimized for large model inference on edge devices.
 arXiv  Detail & Related papers  (2024-09-06T12:55:49Z)
- Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
 We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters.
We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers.
 Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
 arXiv  Detail & Related papers  (2024-06-03T18:49:57Z)
- DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention [82.24166963631949]
 We introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead, following the Diffusion Transformers (DiT) design.
In addition to better performance than DiT, DiG-S/2 exhibits $2.5times$ higher training speed than DiT-S/2 and saves $75.7%$ memory resolution $179times 1792$.
With the same model size, DiG-XL/2 is $4.2times$ faster than the recent Mamba-based diffusion model at a $1024$ resolution, and is $1.8times$ faster than DiT with FlashAttention-2
 arXiv  Detail & Related papers  (2024-05-28T17:59:33Z)
- AI and Memory Wall [81.06494558184049]
 We show how memory bandwidth can become the dominant bottleneck for decoder models.
We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
 arXiv  Detail & Related papers  (2024-03-21T04:31:59Z)
- SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
 Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
 arXiv  Detail & Related papers  (2023-06-13T08:57:54Z)
- Efficiently Scaling Transformer Inference [8.196193683641582]
 We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings.
We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices.
We achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens.
 arXiv  Detail & Related papers  (2022-11-09T18:50:38Z)
- EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense
  Prediction [67.11722682878722]
 This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention.
Our multi-scale linear attention achieves the global receptive field and multi-scale learning.
 EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
 arXiv  Detail & Related papers  (2022-05-29T20:07:23Z)
- LiteTransformerSearch: Training-free On-device Search for Efficient
  Autoregressive Language Models [34.673688610935876]
 We show that the latency and perplexity pareto-frontier can be found without need for any model training.
We evaluate our method, dubbed Lightweight Transformer Search (LTS), on diverse devices.
We show that the perplexity of Transformer-XL can be achieved with up to 2x lower latency.
 arXiv  Detail & Related papers  (2022-03-04T02:10:43Z)
- Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
  Multi-GPU Servers [65.60007071024629]
 We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
 arXiv  Detail & Related papers  (2021-10-13T20:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.