Related papers: The Feasibility of Implementing Large-Scale Transformers on Multi-FPGA Platforms

The Feasibility of Implementing Large-Scale Transformers on Multi-FPGA Platforms

URL: http://arxiv.org/abs/2404.16158v1
Date: Wed, 24 Apr 2024 19:25:58 GMT
Title: The Feasibility of Implementing Large-Scale Transformers on Multi-FPGA Platforms
Authors: Yu Gao, Juan Camilo Vega, Paul Chow,
Abstract summary: There is merit to exploring the use of multiple FPGAs for large machine learning applications. There is no commonly-accepted flow for developing and deploying multi-FPGA applications. We develop a scalable multi-FPGA platform and some tools to map large applications to the platform.
Score: 1.0636475069923585
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: FPGAs are rarely mentioned when discussing the implementation of large machine learning applications, such as Large Language Models (LLMs), in the data center. There has been much evidence showing that single FPGAs can be competitive with GPUs in performance for some computations, especially for low latency, and often much more efficient when power is considered. This suggests that there is merit to exploring the use of multiple FPGAs for large machine learning applications. The challenge with using multiple FPGAs is that there is no commonly-accepted flow for developing and deploying multi-FPGA applications, i.e., there are no tools to describe a large application, map it to multiple FPGAs and then deploy the application on a multi-FPGA platform. In this paper, we explore the feasibility of implementing large transformers using multiple FPGAs by developing a scalable multi-FPGA platform and some tools to map large applications to the platform. We validate our approach by designing an efficient multi-FPGA version of the I-BERT transformer and implement one encoder using six FPGAs as a working proof-of-concept to show that our platform and tools work. Based on our proof-of-concept prototype and the estimations of performance using the latest FPGAs compared to GPUs, we conclude that there can be a place for FPGAs in the world of large machine learning applications. We demonstrate a promising first step that shows that with the right infrastructure and tools it is reasonable to continue to explore the possible benefits of using FPGAs for applications such as LLMs.

Related papers

Many-body computing on Field Programmable Gate Arrays [5.3808713424582395]
We leverage the capabilities of Field Programmable Gate Arrays (FPGAs) for conducting quantum many-body calculations. This has resulted in a tenfold speedup compared to CPU-based computation for a Monte Carlo algorithm. For the first time, the utilization of FPGA to accelerate a typical tensor network algorithm for many-body ground state calculations.
arXiv Detail & Related papers (2024-02-09T14:01:02Z)
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference [11.614722231006695]
Large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs.
arXiv Detail & Related papers (2023-12-23T04:27:06Z)
ACCL+: an FPGA-Based Collective Engine for Distributed Applications [8.511142540352665]
ACCL+ is an open-source versatile FPGA-based collective communication library. It is portable across different platforms and supports UDP, TCP, as well as RDMA. It can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks. We showcase ACCL+'s dual role with two use cases: seamlessly integrating as a collective offload engine to distribute CPU-based vector-matrix multiplication, and serving as a crucial and efficient component in designing fully FPGA-based distributed deep-learning recommendation inference.
arXiv Detail & Related papers (2023-12-18T22:56:01Z)
SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on FPGA Devices [48.47320494918925]
This work tackles the challenges of deploying stateof-the-art object detection models onto FPGA devices for ultralow latency applications. We employ a streaming architecture design for our YOLO accelerators, implementing the complete model on-chip in a deeply pipelined fashion. We introduce novel hardware components to support the operations of YOLO models in a dataflow manner, and off-chip memory buffering to address the limited on-chip memory resources.
arXiv Detail & Related papers (2023-09-04T13:15:01Z)
Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z)
End-to-end codesign of Hessian-aware quantized neural networks for FPGAs and ASICs [49.358119307844035]
We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow. We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the Large Hadron Collider (LHC) We implement an optimized mixed-precision NN for high-momentum particle jets in simulated LHC proton-proton collisions.
arXiv Detail & Related papers (2023-04-13T18:00:01Z)
HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on FPGA Devices [71.45672882756001]
This study introduces a novel streaming architecture based toolflow for mapping 3D Convolutional Neural Networks onto FPGAs. The HARFLOW3D toolflow takes as input a 3D CNN in ONNX format and a description of the FPGA characteristics. The ability of the toolflow to support a broad range of models and devices is shown through a number of experiments on various 3D CNN and FPGA system pairs.
arXiv Detail & Related papers (2023-03-30T08:25:27Z)
LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics [45.666822327616046]
This work presents a novel reconfigurable architecture for Low Graph Neural Network (LL-GNN) designs for particle detectors. The LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.
arXiv Detail & Related papers (2022-09-28T12:55:35Z)
Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark [11.575901540758574]
We present our development experience for the Tiny Inference Benchmark on field-programmable gate array (FPGA) platforms. We use the open-source hls4ml and FINN perJ, which aim to democratize AI- hardware codesign of optimized neural networks on FPGAs. The solutions are deployed on system-on-chip (Pynq-Z2) and pure FPGA (Arty A7-100T) platforms.
arXiv Detail & Related papers (2022-06-23T15:57:17Z)
An FPGA-based Solution for Convolution Operation Acceleration [0.0]
This paper proposes an FPGA-based architecture to accelerate the convolution operation. The project's purpose is to produce an FPGA IP core that can process a convolutional layer at a time.
arXiv Detail & Related papers (2022-06-09T14:12:30Z)
FederatedScope-GNN: Towards a Unified, Comprehensive and Efficient Package for Federated Graph Learning [65.48760613529033]
Federated graph learning (FGL) has not been well supported due to its unique characteristics and requirements. We first discuss the challenges in creating an easy-to-use FGL package and accordingly present our implemented package FederatedScope-GNN (FS-G) We validate the effectiveness of FS-G by conducting extensive experiments, which simultaneously gains many valuable insights about FGL for the community.
arXiv Detail & Related papers (2022-04-12T06:48:06Z)
VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer [121.85581713299918]
We propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized Vision Transformers (ViTs) Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations. This is the first time quantization has been incorporated into ViT acceleration on FPGAs.
arXiv Detail & Related papers (2022-01-17T20:27:52Z)
Multiplierless MP-Kernel Machine For Energy-efficient Edge Devices [6.335302509003343]
We present a novel framework for designing multiplierless kernel machines. The framework uses a piecewise linear (PWL) approximation based on a margin propagation (MP) technique. We propose a hardware-friendly MP-based inference and online training algorithm that has been optimized for a Field Programmable Gate Array (FPGA) platform.
arXiv Detail & Related papers (2021-06-03T16:06:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.