FlightLLM: Efficient Large Language Model Inference with a Complete
Mapping Flow on FPGAs
- URL: http://arxiv.org/abs/2401.03868v2
- Date: Tue, 9 Jan 2024 06:47:46 GMT
- Title: FlightLLM: Efficient Large Language Model Inference with a Complete
Mapping Flow on FPGAs
- Authors: Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang,
Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao
Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang
- Abstract summary: Transformer-based Large Language Models (LLMs) have made a significant impact on various domains.
This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs.
FlightLLM beats NVIDIA A100 GPU with 1.2$times$ higher throughput using the latest Versal VHK158 FPGA.
- Score: 23.381331567339526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based Large Language Models (LLMs) have made a significant impact
on various domains. However, LLMs' efficiency suffers from both heavy
computation and memory overheads. Compression techniques like sparsification
and quantization are commonly used to mitigate the gap between LLM's
computation/memory overheads and hardware capacity. However, existing GPU and
transformer-based accelerators cannot efficiently process compressed LLMs, due
to the following unresolved challenges: low computational efficiency,
underutilized memory bandwidth, and large compilation overheads.
This paper proposes FlightLLM, enabling efficient LLMs inference with a
complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative
solution that the computation and memory overhead of LLMs can be solved by
utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory
hierarchy). We propose a configurable sparse DSP chain to support different
sparsity patterns with high computation efficiency. Second, we propose an
always-on-chip decode scheme to boost memory bandwidth with mixed-precision
support. Finally, to make FlightLLM available for real-world LLMs, we propose a
length adaptive compilation method to reduce the compilation overhead.
Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0$\times$
higher energy efficiency and 1.8$\times$ better cost efficiency against
commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using
vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100
GPU with 1.2$\times$ higher throughput using the latest Versal VHK158 FPGA.
Related papers
- FPGA Co-Design for Efficient N:M Sparse and Quantized Model Inference [0.8749675983608171]
Large language models (LLMs) have demonstrated remarkable performance across a wide range of language processing tasks.<n>This work introduces an automation framework that leverages weight pruning and low-bit quantization.<n>We present a hardware-software co-design method that generates accelerators on the Field-Programmable Gate Array (FPGA) platform.
arXiv Detail & Related papers (2025-12-31T08:27:40Z) - LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs [14.676146518251185]
We present LUT-LLM, the first FPGA accelerator enabling 1B+ LLM inference via vector-quantized memory operations.<n>LUT-LLM achieves 1.66x lower latency than AMD MI210 and 1.72x higher energy efficiency than NVIDIA A100, scaling to 32B models with 2.16x efficiency gain over A100.
arXiv Detail & Related papers (2025-11-09T01:17:08Z) - APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration [5.075697428779204]
Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance.<n>This is primarily due to the limited support for the GPU Cores, inefficient memory management, and inflexible kernel optimizations.<n>We propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM.
arXiv Detail & Related papers (2025-08-26T14:48:29Z) - 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float [52.079202872069835]
Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs) have grown rapidly in size.<n>We introduce Dynamic-Length Float (DFloat11), a compression framework that reduces LLM and DM size by 30% while preserving outputs that are bit-for-bit identical to the original model.
arXiv Detail & Related papers (2025-04-15T22:38:38Z) - Dynamic Low-Rank Sparse Adaptation for Large Language Models [54.1231638555233]
Low-rank Sparse Adaptation (LoSA) is a novel method that seamlessly integrates low-rank adaptation into sparse LLM sparsity.
LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning.
LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden.
arXiv Detail & Related papers (2025-02-20T18:37:32Z) - MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs [5.88896081401217]
We introduce MEADOW, a framework that significantly reduces the off-chip memory access for large language models.
MEADOW demonstrates 1.5x and 2.5x lower decode and prefill latency, respectively, compared to a GEMM-based LLM implementation.
MEADOW achieves an end-to-end latency improvement of over 40%, compared to prior LLM optimization works.
arXiv Detail & Related papers (2025-02-14T23:50:37Z) - Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs [0.8217552831952]
Large language models (LLMs) have transformed the way we think about language understanding and generation.
Group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process.
We present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions.
arXiv Detail & Related papers (2024-12-23T03:44:29Z) - Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores [3.6385567224218556]
Large language models (LLMs) have been widely applied but face challenges in efficient inference.
We introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization.
We implement an arbitrary precision matrix multiplication scheme that decomposes and recovers at the bit level, enabling flexible precision.
arXiv Detail & Related papers (2024-09-26T14:17:58Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - Scalable MatMul-free Language Modeling [8.672867887354977]
We show that MatMul operations can be completely eliminated from large language models.
Our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits [129.6765656933016]
We introduce a 1-bit Large Language Models (LLMs) variant, namely BitNet b1.58.
The 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs.
It enables a new paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
arXiv Detail & Related papers (2024-02-27T18:56:19Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Efficient LLM inference solution on Intel GPU [19.154403468201924]
Transformer based Large Language Models (LLMs) have been widely used in many fields.
We propose an efficient LLM inference solution with low latency and high throughput.
Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput.
arXiv Detail & Related papers (2023-12-19T05:40:43Z) - FlashDecoding++: Faster Large Language Model Inference on GPUs [16.289377349637995]
We present FlashDecoding++, a fast inference engine supporting mainstream Large Language Model (LLM) inference.
To tackle the above challenges, FlashDecoding++ introduces a unified max value technique for different partial softmax computations.
FlashDecoding++ can achieve up to 4.86x and 2.18x speedup on both NVIDIA and AMD GPUs.
arXiv Detail & Related papers (2023-11-02T14:57:03Z) - Efficient LLM Inference on CPUs [8.802223672775844]
Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks.
deploying these models has been challenging due to the astronomical amount of model parameters.
We propose an effective approach that can make the deployment of LLMs more efficiently.
arXiv Detail & Related papers (2023-11-01T13:08:50Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training.
We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.