ACCL+: an FPGA-Based Collective Engine for Distributed Applications
- URL: http://arxiv.org/abs/2312.11742v1
- Date: Mon, 18 Dec 2023 22:56:01 GMT
- Title: ACCL+: an FPGA-Based Collective Engine for Distributed Applications
- Authors: Zhenhao He, Dario Korolija, Yu Zhu, Benjamin Ramhorst, Tristan Laan,
Lucian Petrica, Michaela Blott, Gustavo Alonso
- Abstract summary: ACCL+ is an open-source versatile FPGA-based collective communication library.
It is portable across different platforms and supports UDP, TCP, as well as RDMA.
It can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks.
We showcase ACCL+'s dual role with two use cases: seamlessly integrating as a collective offload engine to distribute CPU-based vector-matrix multiplication, and serving as a crucial and efficient component in designing fully FPGA-based distributed deep-learning recommendation inference.
- Score: 8.511142540352665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs
or network-attached accelerators. Despite their potential, developing
distributed FPGA-accelerated applications remains cumbersome due to the lack of
appropriate infrastructure and communication abstractions. To facilitate the
development of distributed applications with FPGAs, in this paper we propose
ACCL+, an open-source versatile FPGA-based collective communication library.
Portable across different platforms and supporting UDP, TCP, as well as RDMA,
ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective
communication. Additionally, it can serve as a collective offload engine for
CPU applications, freeing the CPU from networking tasks. It is user-extensible,
allowing new collectives to be implemented and deployed without having to
re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100
Gb/s networking, comparing its performance against software MPI over RDMA. The
results demonstrate ACCL+'s significant advantages for FPGA-based distributed
applications and highly competitive performance for CPU applications. We
showcase ACCL+'s dual role with two use cases: seamlessly integrating as a
collective offload engine to distribute CPU-based vector-matrix multiplication,
and serving as a crucial and efficient component in designing fully FPGA-based
distributed deep-learning recommendation inference.
Related papers
- Hacking the Fabric: Targeting Partial Reconfiguration for Fault Injection in FPGA Fabrics [2.511032692122208]
We present a novel fault attack methodology capable of causing persistent fault injections in partial bitstreams during the process of FPGA reconfiguration.
This attack leverages power-wasters and is timed to inject faults into bitstreams as they are being loaded onto the FPGA through the reconfiguration manager.
arXiv Detail & Related papers (2024-10-21T20:40:02Z) - Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator [0.5714074111744111]
We present and end-to-end workflow for deployment of CNNs on Field Programmable Gate Arrays (FPGAs) using the Gemmini accelerator.
We were able to achieve real-time performance by deploying a YOLOv7 model on a Xilinx ZCU102 FPGA with an energy efficiency of 36.5 GOP/s/W.
arXiv Detail & Related papers (2024-08-14T09:24:00Z) - The Feasibility of Implementing Large-Scale Transformers on Multi-FPGA Platforms [1.0636475069923585]
There is merit to exploring the use of multiple FPGAs for large machine learning applications.
There is no commonly-accepted flow for developing and deploying multi-FPGA applications.
We develop a scalable multi-FPGA platform and some tools to map large applications to the platform.
arXiv Detail & Related papers (2024-04-24T19:25:58Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - End-to-end codesign of Hessian-aware quantized neural networks for FPGAs
and ASICs [49.358119307844035]
We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs)
This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow.
We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the Large Hadron Collider (LHC)
We implement an optimized mixed-precision NN for high-momentum particle jets in simulated LHC proton-proton collisions.
arXiv Detail & Related papers (2023-04-13T18:00:01Z) - RAMP: A Flat Nanosecond Optical Network and MPI Operations for
Distributed Deep Learning Systems [68.8204255655161]
We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP.
RAMP supports large-scale distributed and parallel computing systems (12.8Tbps per node for up to 65,536 nodes.
arXiv Detail & Related papers (2022-11-28T11:24:51Z) - FFCNN: Fast FPGA based Acceleration for Convolution neural network
inference [0.0]
We present Fast Inference on FPGAs for Convolution Neural Network (FFCNN)
FFCNN is based on a deeply pipelined OpenCL kernels architecture.
Data reuse and task mapping techniques are also presented to improve design efficiency.
arXiv Detail & Related papers (2022-08-28T16:55:25Z) - An FPGA-based Solution for Convolution Operation Acceleration [0.0]
This paper proposes an FPGA-based architecture to accelerate the convolution operation.
The project's purpose is to produce an FPGA IP core that can process a convolutional layer at a time.
arXiv Detail & Related papers (2022-06-09T14:12:30Z) - FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.