FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU
- URL: http://arxiv.org/abs/2303.06865v2
- Date: Mon, 12 Jun 2023 07:48:53 GMT
- Title: FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU
- Authors: Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin,
Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez,
Percy Liang, Christopher R\'e, Ion Stoica, Ce Zhang
- Abstract summary: FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
- Score: 89.2451963569343
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The high computational and memory requirements of large language model (LLM)
inference make it feasible only with multiple high-end accelerators. Motivated
by the emerging demand for latency-insensitive tasks with batched processing,
this paper initiates the study of high-throughput LLM inference using limited
resources, such as a single commodity GPU. We present FlexGen, a
high-throughput generation engine for running LLMs with limited GPU memory.
FlexGen can be flexibly configured under various hardware resource constraints
by aggregating memory and computation from the GPU, CPU, and disk. By solving a
linear programming problem, it searches for efficient patterns to store and
access tensors. FlexGen further compresses the weights and the attention cache
to 4 bits with negligible accuracy loss. These techniques enable FlexGen to
have a larger space of batch size choices and thus significantly increase
maximum throughput. As a result, when running OPT-175B on a single 16GB GPU,
FlexGen achieves significantly higher throughput compared to state-of-the-art
offloading systems, reaching a generation throughput of 1 token/s for the first
time with an effective batch size of 144. On the HELM benchmark, FlexGen can
benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21
hours. The code is available at https://github.com/FMInference/FlexGen
Related papers
- Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models [40.41898661688188]
This paper introduces Superpipeline, a framework designed to optimize the execution of large AI models on constrained hardware.
Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds.
arXiv Detail & Related papers (2024-10-11T13:17:05Z) - FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs [0.0]
Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV)
This paper proposes textitFAMOUS, a flexible hardware accelerator for dense multi-head attention computation of TNNs on field-programmable gate arrays (FPGAs)
It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency.
arXiv Detail & Related papers (2024-09-21T05:25:46Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference [47.043257902725294]
We propose a novel sparse format that compresses unstructured sparse pattern of pruned LLM weights to non-zero values with high compression ratio and low decompression overhead.
Compared to offloaded inference using the popular Huggingface Accelerate, applying Endor accelerates OPT-66B by 1.70x and Llama2-70B by 1.78x.
arXiv Detail & Related papers (2024-06-17T15:55:08Z) - FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion [9.5114389643299]
This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs.
Flux can potentially overlap up to 96% of communication given a fused kernel.
Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPU with various GPU generations and interconnects.
arXiv Detail & Related papers (2024-06-11T00:17:39Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - JORA: JAX Tensor-Parallel LoRA Library for Retrieval Augmented Fine-Tuning [16.86356520836045]
We introduce a novel framework for PEFT-compatible fine-tuning of Llama-2 models, leveraging distributed training.
Our framework uniquely utilizes JAX's just-in-time (JIT) compilation and tensor-sharding for efficient resource management.
Our experiments show more than 12x improvement in runtime compared to Hugging Face/DeepSpeed implementation with four GPU while consuming less than half the VRAM per GPU.
arXiv Detail & Related papers (2024-03-17T23:02:04Z) - FlexLLM: A System for Co-Serving Large Language Model Inference and
Parameter-Efficient Finetuning [9.979010592887096]
Existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests.
We present FlexLLM, the first system that can serve inference and parameter-efficient finetuning requests in the same iteration.
Compared to existing systems, FlexLLM's co-serving approach reduces the activation GPU memory overhead by up to 8x, and the end-to-end GPU memory requirement of finetuning by up to 36%.
arXiv Detail & Related papers (2024-02-29T01:33:08Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.