FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems
- URL: http://arxiv.org/abs/2204.10943v1
- Date: Fri, 22 Apr 2022 21:57:00 GMT
- Title: FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems
- Authors: Rui Ma, Evangelos Georganas, Alexander Heinecke, Andrew Boutros, Eriko
Nurvitadhi
- Abstract summary: We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
- Score: 62.20308752994373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Rapid advances in artificial intelligence (AI) technology have led to
significant accuracy improvements in a myriad of application domains at the
cost of larger and more compute-intensive models. Training such models on
massive amounts of data typically requires scaling to many compute nodes and
relies heavily on collective communication algorithms, such as all-reduce, to
exchange the weight gradients between different nodes. The overhead of these
collective communication operations in a distributed AI training system can
bottleneck its performance, with more pronounced effects as the number of nodes
increases. In this paper, we first characterize the all-reduce operation
overhead by profiling distributed AI training. Then, we propose a new smart
network interface card (NIC) for distributed AI training systems using
field-programmable gate arrays (FPGAs) to accelerate all-reduce operations and
optimize network bandwidth utilization via data compression. The AI smart NIC
frees up the system's compute resources to perform the more compute-intensive
tensor operations and increases the overall node-to-node communication
efficiency. We perform real measurements on a prototype distributed AI training
system comprised of 6 compute nodes to evaluate the performance gains of our
proposed FPGA-based AI smart NIC compared to a baseline system with regular
NICs. We also use these measurements to validate an analytical model that we
formulate to predict performance when scaling to larger systems. Our proposed
FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6
nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to
the baseline system using conventional NICs.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Exploiting On-chip Heterogeneity of Versal Architecture for GNN
Inference Acceleration [0.5249805590164902]
Graph Neural Networks (GNNs) have revolutionized many Machine Learning (ML) applications, such as social network analysis, bioinformatics, etc.
We leverage the heterogeneous computing capabilities of AMD Versal ACAP architecture to accelerate GNN inference.
For Graph Convolutional Network (GCN) inference, our approach leads to a speedup of 3.9-96.7x compared to designs using PL only on the same ACAP device.
arXiv Detail & Related papers (2023-08-04T23:57:55Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Learning Connectivity-Maximizing Network Configurations [123.01665966032014]
We propose a supervised learning approach with a convolutional neural network (CNN) that learns to place communication agents from an expert.
We demonstrate the performance of our CNN on canonical line and ring topologies, 105k randomly generated test cases, and larger teams not seen during training.
After training, our system produces connected configurations 2 orders of magnitude faster than the optimization-based scheme for teams of 10-20 agents.
arXiv Detail & Related papers (2021-12-14T18:59:01Z) - Fully-parallel Convolutional Neural Network Hardware [0.7829352305480285]
We propose a new power-and-area-efficient architecture for implementing Articial Neural Networks (ANNs) in hardware.
For the first time, a fully-parallel CNN as LENET-5 is embedded and tested in a single FPGA.
arXiv Detail & Related papers (2020-06-22T17:19:09Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms [1.2183405753834562]
Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning model for representation learning on graphs.
It is challenging to accelerate training of GCNs due to substantial and irregular data communication.
We design a novel accelerator for training GCNs on CPU-FPGA heterogeneous systems.
arXiv Detail & Related papers (2019-12-31T21:19:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.