Related papers: FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

URL: http://arxiv.org/abs/2401.03868v2
Date: Tue, 9 Jan 2024 06:47:46 GMT
Title: FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs
Authors: Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang
Abstract summary: Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs. FlightLLM beats NVIDIA A100 GPU with 1.2$times$ higher throughput using the latest Versal VHK158 FPGA.
Score: 23.381331567339526
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and quantization are commonly used to mitigate the gap between LLM's computation/memory overheads and hardware capacity. However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads. This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative solution that the computation and memory overhead of LLMs can be solved by utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory hierarchy). We propose a configurable sparse DSP chain to support different sparsity patterns with high computation efficiency. Second, we propose an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support. Finally, to make FlightLLM available for real-world LLMs, we propose a length adaptive compilation method to reduce the compilation overhead. Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0$\times$ higher energy efficiency and 1.8$\times$ better cost efficiency against commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100 GPU with 1.2$\times$ higher throughput using the latest Versal VHK158 FPGA.

Related papers

Dynamic Low-Rank Sparse Adaptation for Large Language Models [54.1231638555233]
Low-rank Sparse Adaptation (LoSA) is a novel method that seamlessly integrates low-rank adaptation into sparse LLM sparsity. LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning. LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden.
arXiv Detail & Related papers (2025-02-20T18:37:32Z)
MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs [5.88896081401217]
We introduce MEADOW, a framework that significantly reduces the off-chip memory access for large language models. MEADOW demonstrates 1.5x and 2.5x lower decode and prefill latency, respectively, compared to a GEMM-based LLM implementation. MEADOW achieves an end-to-end latency improvement of over 40%, compared to prior LLM optimization works.
arXiv Detail & Related papers (2025-02-14T23:50:37Z)
Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs [0.8217552831952]
Large language models (LLMs) have transformed the way we think about language understanding and generation. Group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process. We present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions.
arXiv Detail & Related papers (2024-12-23T03:44:29Z)
Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores [3.6385567224218556]
Large language models (LLMs) have been widely applied but face challenges in efficient inference. We introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization. We implement an arbitrary precision matrix multiplication scheme that decomposes and recovers at the bit level, enabling flexible precision.
arXiv Detail & Related papers (2024-09-26T14:17:58Z)
Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration. Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z)
Scalable MatMul-free Language Modeling [8.672867887354977]
We show that MatMul operations can be completely eliminated from large language models. Our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits [129.6765656933016]
We introduce a 1-bit Large Language Models (LLMs) variant, namely BitNet b1.58. The 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs. It enables a new paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
arXiv Detail & Related papers (2024-02-27T18:56:19Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
Efficient LLM inference solution on Intel GPU [19.154403468201924]
Transformer based Large Language Models (LLMs) have been widely used in many fields. We propose an efficient LLM inference solution with low latency and high throughput. Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput.
arXiv Detail & Related papers (2023-12-19T05:40:43Z)
FlashDecoding++: Faster Large Language Model Inference on GPUs [16.289377349637995]
We present FlashDecoding++, a fast inference engine supporting mainstream Large Language Model (LLM) inference. To tackle the above challenges, FlashDecoding++ introduces a unified max value technique for different partial softmax computations. FlashDecoding++ can achieve up to 4.86x and 2.18x speedup on both NVIDIA and AMD GPUs.
arXiv Detail & Related papers (2023-11-02T14:57:03Z)
Efficient LLM Inference on CPUs [8.802223672775844]
Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. deploying these models has been challenging due to the astronomical amount of model parameters. We propose an effective approach that can make the deployment of LLMs more efficiently.
arXiv Detail & Related papers (2023-11-01T13:08:50Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training. We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.