DFX: A Low-latency Multi-FPGA Appliance for Accelerating
Transformer-based Text Generation
- URL: http://arxiv.org/abs/2209.10797v1
- Date: Thu, 22 Sep 2022 05:59:59 GMT
- Title: DFX: A Low-latency Multi-FPGA Appliance for Accelerating
Transformer-based Text Generation
- Authors: Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim,
Dongsoo Lee, Joo-Young Kim
- Abstract summary: We present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model end-to-end with low latency and high throughput.
We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources.
- Score: 7.3619135783046
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Transformer is a deep learning language model widely used for natural
language processing (NLP) services in datacenters. Among transformer models,
Generative Pre-trained Transformer (GPT) has achieved remarkable performance in
text generation, or natural language generation (NLG), which needs the
processing of a large input context in the summarization stage, followed by the
generation stage that produces a single word at a time. The conventional
platforms such as GPU are specialized for the parallel processing of large
inputs in the summarization stage, but their performance significantly degrades
in the generation stage due to its sequential characteristic. Therefore, an
efficient hardware platform is required to address the high latency caused by
the sequential characteristic of text generation.
In this paper, we present DFX, a multi-FPGA acceleration appliance that
executes GPT-2 model inference end-to-end with low latency and high throughput
in both summarization and generation stages. DFX uses model parallelism and
optimized dataflow that is model-and-hardware-aware for fast simultaneous
workload execution among devices. Its compute cores operate on custom
instructions and provide GPT-2 operations end-to-end. We implement the proposed
hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the
channels of the high bandwidth memory (HBM) and the maximum number of compute
resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99x
energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is
also 8.21x more cost-effective than the GPU appliance, suggesting that it is a
promising solution for text generation workloads in cloud datacenters.
Related papers
- FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs [0.0]
Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV)
This paper proposes textitFAMOUS, a flexible hardware accelerator for dense multi-head attention computation of TNNs on field-programmable gate arrays (FPGAs)
It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency.
arXiv Detail & Related papers (2024-09-21T05:25:46Z) - ProTEA: Programmable Transformer Encoder Acceleration on FPGA [0.0]
Transformer neural networks (TNNs) have been widely utilized on a diverse range of applications, including natural language processing (NLP), machine translation, and computer vision (CV)
Despite the popularity of TNNs, there has been limited numbers of hardware accelerators targeting these two critical blocks.
This paper introduces textitProTEA, a programmable runtime accelerator tailored for the dense computations of state-of-the-art transformer encoders.
arXiv Detail & Related papers (2024-09-21T01:44:13Z) - HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis [0.1979158763744267]
We develop an accelerator for transformers, namely, Llama 2, using high level synthesis (HLS) on Field Programmable Gate Arrays (FPGAs)
We name our method HLSTransform, and the FPGA designs we synthesize with HLS achieve up to a 12.75x reduction and 8.25x reduction in energy used per token.
With the lack of existing open-source FPGA accelerators for transformers, we open-source our code and document our steps for synthesis.
arXiv Detail & Related papers (2024-04-29T21:26:06Z) - Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference [11.614722231006695]
Large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads.
This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs.
arXiv Detail & Related papers (2023-12-23T04:27:06Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - X-Former: In-Memory Acceleration of Transformers [7.194491150684456]
Transformers have achieved great success in a wide variety of natural language processing (NLP) tasks due to the attention mechanism.
Traditional deep neural network (DNN) accelerators face limitations in processing Transformers efficiently.
In-memory accelerators based on non-volatile memory promise to be an effective solution to this challenge.
We present X-Former, a hybrid in-memory hardware accelerator that consists of both NVM and CMOS processing elements.
arXiv Detail & Related papers (2023-03-13T21:11:54Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense
Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention.
Our multi-scale linear attention achieves the global receptive field and multi-scale learning.
EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z) - Glancing Transformer for Non-Autoregressive Neural Machine Translation [58.87258329683682]
We propose a method to learn word interdependency for single-pass parallel generation models.
With only single-pass parallel decoding, GLAT is able to generate high-quality translation with 8-15 times speedup.
arXiv Detail & Related papers (2020-08-18T13:04:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.