FlightLLM: Efficient Large Language Model Inference with a Complete
Mapping Flow on FPGAs
- URL: http://arxiv.org/abs/2401.03868v2
- Date: Tue, 9 Jan 2024 06:47:46 GMT
- Title: FlightLLM: Efficient Large Language Model Inference with a Complete
Mapping Flow on FPGAs
- Authors: Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang,
Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao
Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang
- Abstract summary: Transformer-based Large Language Models (LLMs) have made a significant impact on various domains.
This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs.
FlightLLM beats NVIDIA A100 GPU with 1.2$times$ higher throughput using the latest Versal VHK158 FPGA.
- Score: 23.381331567339526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based Large Language Models (LLMs) have made a significant impact
on various domains. However, LLMs' efficiency suffers from both heavy
computation and memory overheads. Compression techniques like sparsification
and quantization are commonly used to mitigate the gap between LLM's
computation/memory overheads and hardware capacity. However, existing GPU and
transformer-based accelerators cannot efficiently process compressed LLMs, due
to the following unresolved challenges: low computational efficiency,
underutilized memory bandwidth, and large compilation overheads.
This paper proposes FlightLLM, enabling efficient LLMs inference with a
complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative
solution that the computation and memory overhead of LLMs can be solved by
utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory
hierarchy). We propose a configurable sparse DSP chain to support different
sparsity patterns with high computation efficiency. Second, we propose an
always-on-chip decode scheme to boost memory bandwidth with mixed-precision
support. Finally, to make FlightLLM available for real-world LLMs, we propose a
length adaptive compilation method to reduce the compilation overhead.
Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0$\times$
higher energy efficiency and 1.8$\times$ better cost efficiency against
commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using
vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100
GPU with 1.2$\times$ higher throughput using the latest Versal VHK158 FPGA.
Related papers
- Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores [3.6385567224218556]
Large language models (LLMs) have been widely applied but face challenges in efficient inference.
We introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization.
We implement an arbitrary precision matrix multiplication scheme that decomposes and recovers at the bit level, enabling flexible precision.
arXiv Detail & Related papers (2024-09-26T14:17:58Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - Scalable MatMul-free Language Modeling [8.672867887354977]
We show that MatMul operations can be completely eliminated from large language models.
Our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits [129.6765656933016]
We introduce a 1-bit Large Language Models (LLMs) variant, namely BitNet b1.58.
The 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs.
It enables a new paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
arXiv Detail & Related papers (2024-02-27T18:56:19Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Efficient LLM inference solution on Intel GPU [19.154403468201924]
Transformer based Large Language Models (LLMs) have been widely used in many fields.
We propose an efficient LLM inference solution with low latency and high throughput.
Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput.
arXiv Detail & Related papers (2023-12-19T05:40:43Z) - FlashDecoding++: Faster Large Language Model Inference on GPUs [16.289377349637995]
We present FlashDecoding++, a fast inference engine supporting mainstream Large Language Model (LLM) inference.
To tackle the above challenges, FlashDecoding++ introduces a unified max value technique for different partial softmax computations.
FlashDecoding++ can achieve up to 4.86x and 2.18x speedup on both NVIDIA and AMD GPUs.
arXiv Detail & Related papers (2023-11-02T14:57:03Z) - Efficient LLM Inference on CPUs [8.802223672775844]
Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks.
deploying these models has been challenging due to the astronomical amount of model parameters.
We propose an effective approach that can make the deployment of LLMs more efficiently.
arXiv Detail & Related papers (2023-11-01T13:08:50Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training.
We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.