FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU
- URL: http://arxiv.org/abs/2303.06865v2
- Date: Mon, 12 Jun 2023 07:48:53 GMT
- Title: FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU
- Authors: Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin,
Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez,
Percy Liang, Christopher R\'e, Ion Stoica, Ce Zhang
- Abstract summary: FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
- Score: 89.2451963569343
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The high computational and memory requirements of large language model (LLM)
inference make it feasible only with multiple high-end accelerators. Motivated
by the emerging demand for latency-insensitive tasks with batched processing,
this paper initiates the study of high-throughput LLM inference using limited
resources, such as a single commodity GPU. We present FlexGen, a
high-throughput generation engine for running LLMs with limited GPU memory.
FlexGen can be flexibly configured under various hardware resource constraints
by aggregating memory and computation from the GPU, CPU, and disk. By solving a
linear programming problem, it searches for efficient patterns to store and
access tensors. FlexGen further compresses the weights and the attention cache
to 4 bits with negligible accuracy loss. These techniques enable FlexGen to
have a larger space of batch size choices and thus significantly increase
maximum throughput. As a result, when running OPT-175B on a single 16GB GPU,
FlexGen achieves significantly higher throughput compared to state-of-the-art
offloading systems, reaching a generation throughput of 1 token/s for the first
time with an effective batch size of 144. On the HELM benchmark, FlexGen can
benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21
hours. The code is available at https://github.com/FMInference/FlexGen
Related papers
- vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference [47.043257902725294]
We propose a novel sparse format that compresses unstructured sparse pattern of pruned LLM weights to non-zero values with high compression ratio and low decompression overhead.
Compared to offloaded inference using the popular Huggingface Accelerate, applying Endor accelerates OPT-66B by 1.70x and Llama2-70B by 1.78x.
arXiv Detail & Related papers (2024-06-17T15:55:08Z) - FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion [9.743943561871825]
This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs.
Flux can potentially overlap up to 96% of communication given a fused kernel.
Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPU with various GPU generations and interconnects.
arXiv Detail & Related papers (2024-06-11T00:17:39Z) - MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter [40.616849959987555]
We introduce a novel mechanism that fine-tunes Large Language Models (LLMs) with adapters of larger size yet memory-efficient.
This is achieved by leveraging the inherent activation sparsity in the Feed-Forward Networks (FFNs) of LLMs.
We employ a Mixture of Experts (MoE)-like architecture to mitigate unnecessary CPU computations and reduce the communication volume between the GPU and CPU.
arXiv Detail & Related papers (2024-06-07T14:49:22Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - JORA: JAX Tensor-Parallel LoRA Library for Retrieval Augmented Fine-Tuning [16.86356520836045]
We introduce a novel framework for PEFT-compatible fine-tuning of Llama-2 models, leveraging distributed training.
Our framework uniquely utilizes JAX's just-in-time (JIT) compilation and tensor-sharding for efficient resource management.
Our experiments show more than 12x improvement in runtime compared to Hugging Face/DeepSpeed implementation with four GPU while consuming less than half the VRAM per GPU.
arXiv Detail & Related papers (2024-03-17T23:02:04Z) - FlexLLM: A System for Co-Serving Large Language Model Inference and
Parameter-Efficient Finetuning [9.979010592887096]
Existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests.
We present FlexLLM, the first system that can serve inference and parameter-efficient finetuning requests in the same iteration.
Compared to existing systems, FlexLLM's co-serving approach reduces the activation GPU memory overhead by up to 8x, and the end-to-end GPU memory requirement of finetuning by up to 36%.
arXiv Detail & Related papers (2024-02-29T01:33:08Z) - QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models [64.34635279436054]
Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing.
We present a solution to this memory problem, in form of a new compression and execution framework called QMoE.
arXiv Detail & Related papers (2023-10-25T17:24:53Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z) - ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep
Learning [9.322987670900778]
ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters.
It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible.
arXiv Detail & Related papers (2021-04-16T02:22:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.