Related papers: HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis

HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis

URL: http://arxiv.org/abs/2405.00738v1
Date: Mon, 29 Apr 2024 21:26:06 GMT
Title: HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis
Authors: Andy He, Darren Key, Mason Bulling, Andrew Chang, Skyler Shapiro, Everett Lee,
Abstract summary: We develop an accelerator for transformers, namely, Llama 2, using high level synthesis (HLS) on Field Programmable Gate Arrays (FPGAs) We name our method HLSTransform, and the FPGA designs we synthesize with HLS achieve up to a 12.75x reduction and 8.25x reduction in energy used per token. With the lack of existing open-source FPGA accelerators for transformers, we open-source our code and document our steps for synthesis.
Score: 0.1979158763744267
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Graphics Processing Units (GPUs) have become the leading hardware accelerator for deep learning applications and are used widely in training and inference of transformers; transformers have achieved state-of-the-art performance in many areas of machine learning and are especially used in most modern Large Language Models (LLMs). However, GPUs require large amounts of energy, which poses environmental concerns, demands high operational costs, and causes GPUs to be unsuitable for edge computing. We develop an accelerator for transformers, namely, Llama 2, an open-source state-of-the-art LLM, using high level synthesis (HLS) on Field Programmable Gate Arrays (FPGAs). HLS allows us to rapidly prototype FPGA designs without writing code at the register-transfer level (RTL). We name our method HLSTransform, and the FPGA designs we synthesize with HLS achieve up to a 12.75x reduction and 8.25x reduction in energy used per token on the Xilinx Virtex UltraScale+ VU9P FPGA compared to an Intel Xeon Broadwell E5-2686 v4 CPU and NVIDIA RTX 3090 GPU respectively, while increasing inference speeds by up to 2.46x compared to CPU and maintaining 0.53x the speed of an RTX 3090 GPU despite the GPU's 4 times higher base clock rate. With the lack of existing open-source FPGA accelerators for transformers, we open-source our code and document our steps for synthesis. We hope this work will serve as a step in democratizing the use of FPGAs in transformer inference and inspire research into energy-efficient inference methods as a whole. The code can be found on https://github.com/HLSTransform/submission.

Related papers

A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs [0.0]
ADAPTOR is a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. It incorporates efficient matrix tiling to distribute resources across FPGA platforms. It achieves a speedup of 1.7 to 2.25$times$ compared to some state-of-the-art FPGA-based accelerators.
arXiv Detail & Related papers (2024-11-27T08:53:19Z)
FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs [0.0]
Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV) This paper proposes textitFAMOUS, a flexible hardware accelerator for dense multi-head attention computation of TNNs on field-programmable gate arrays (FPGAs) It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency.
arXiv Detail & Related papers (2024-09-21T05:25:46Z)
Shallow Cross-Encoders for Low-Latency Retrieval [69.06104373460597]
Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window. We show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings.
arXiv Detail & Related papers (2024-03-29T15:07:21Z)
GES: Generalized Exponential Splatting for Efficient Radiance Field Rendering [112.16239342037714]
GES (Generalized Exponential Splatting) is a novel representation that employs Generalized Exponential Function (GEF) to model 3D scenes. With the aid of a frequency-modulated loss, GES achieves competitive performance in novel-view synthesis benchmarks.
arXiv Detail & Related papers (2024-02-15T17:32:50Z)
Many-body computing on Field Programmable Gate Arrays [5.3808713424582395]
We leverage the capabilities of Field Programmable Gate Arrays (FPGAs) for conducting quantum many-body calculations. This has resulted in a tenfold speedup compared to CPU-based computation for a Monte Carlo algorithm. For the first time, the utilization of FPGA to accelerate a typical tensor network algorithm for many-body ground state calculations.
arXiv Detail & Related papers (2024-02-09T14:01:02Z)
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference [11.614722231006695]
Large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs.
arXiv Detail & Related papers (2023-12-23T04:27:06Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU. When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z)
DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation [7.3619135783046]
We present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model end-to-end with low latency and high throughput. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources.
arXiv Detail & Related papers (2022-09-22T05:59:59Z)
Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks. The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources. This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)
A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining [28.336502115532905]
This paper proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. We develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. Our design has very small accuracy loss and has 80.2 $times$ and 2.6 $times$ speedup compared to CPU and GPU implementation.
arXiv Detail & Related papers (2022-08-07T05:48:38Z)
FTRANS: Energy-Efficient Acceleration of Transformers using FPGA [11.032972017827248]
We propose an efficient acceleration framework, Ftrans, for transformer-based large scale language representations. Our framework significantly reduces the model size of NLP models by up to 16 times. Our FPGA design achieves 27.07x and 81x improvement in performance and energy efficiency compared to CPU, and up to 8.80x improvement in energy efficiency compared to GPU.
arXiv Detail & Related papers (2020-07-16T18:58:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.