GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms
- URL: http://arxiv.org/abs/2001.02498v1
- Date: Tue, 31 Dec 2019 21:19:01 GMT
- Title: GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms
- Authors: Hanqing Zeng, Viktor Prasanna
- Abstract summary: Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning model for representation learning on graphs.
It is challenging to accelerate training of GCNs due to substantial and irregular data communication.
We design a novel accelerator for training GCNs on CPU-FPGA heterogeneous systems.
- Score: 1.2183405753834562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep
learning model for representation learning on graphs. It is challenging to
accelerate training of GCNs, due to (1) substantial and irregular data
communication to propagate information within the graph, and (2) intensive
computation to propagate information along the neural network layers. To
address these challenges, we design a novel accelerator for training GCNs on
CPU-FPGA heterogeneous systems, by incorporating multiple
algorithm-architecture co-optimizations. We first analyze the computation and
communication characteristics of various GCN training algorithms, and select a
subgraph-based algorithm that is well suited for hardware execution. To
optimize the feature propagation within subgraphs, we propose a lightweight
pre-processing step based on a graph theoretic approach. Such pre-processing
performed on the CPU significantly reduces the memory access requirements and
the computation to be performed on the FPGA. To accelerate the weight update in
GCN layers, we propose a systolic array based design for efficient
parallelization. We integrate the above optimizations into a complete hardware
pipeline, and analyze its load-balance and resource utilization by accurate
performance modeling. We evaluate our design on a Xilinx Alveo U200 board
hosted by a 40-core Xeon server. On three large graphs, we achieve an order of
magnitude training speedup with negligible accuracy loss, compared with
state-of-the-art implementation on a multi-core platform.
Related papers
- Efficient Message Passing Architecture for GCN Training on HBM-based FPGAs with Orthogonal Topology On-Chip Networks [0.0]
Graph Convolutional Networks (GCNs) are state-of-the-art deep learning models for representation learning on graphs.
We propose a message-passing architecture that leverages NUMA-based memory access properties.
We also re-engineered the backpropagation algorithm specific to GCNs within our proposed accelerator.
arXiv Detail & Related papers (2024-11-06T12:00:51Z) - MassiveGNN: Efficient Training via Prefetching for Massively Connected Distributed Graphs [11.026326555186333]
This paper develops a parameterized continuous prefetch and eviction scheme on top of the state-of-the-art Amazon DistDGL distributed GNN framework.
It demonstrates about 15-40% improvement in end-to-end training performance on the National Energy Research Scientific Computing Center's (NERSC) Perlmutter supercomputer.
arXiv Detail & Related papers (2024-10-30T05:10:38Z) - T-GAE: Transferable Graph Autoencoder for Network Alignment [79.89704126746204]
T-GAE is a graph autoencoder framework that leverages transferability and stability of GNNs to achieve efficient network alignment without retraining.
Our experiments demonstrate that T-GAE outperforms the state-of-the-art optimization method and the best GNN approach by up to 38.7% and 50.8%, respectively.
arXiv Detail & Related papers (2023-10-05T02:58:29Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Communication-Free Distributed GNN Training with Vertex Cut [63.22674903170953]
CoFree-GNN is a novel distributed GNN training framework that significantly speeds up the training process by implementing communication-free training.
We demonstrate that CoFree-GNN speeds up the GNN training process by up to 10 times over the existing state-of-the-art GNN training approaches.
arXiv Detail & Related papers (2023-08-06T21:04:58Z) - Scalable Graph Convolutional Network Training on Distributed-Memory
Systems [5.169989177779801]
Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs.
Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges.
We propose a highly parallel training algorithm that scales to large processor counts.
arXiv Detail & Related papers (2022-12-09T17:51:13Z) - A Comprehensive Study on Large-Scale Graph Training: Benchmarking and
Rethinking [124.21408098724551]
Large-scale graph training is a notoriously challenging problem for graph neural networks (GNNs)
We present a new ensembling training manner, named EnGCN, to address the existing issues.
Our proposed method has achieved new state-of-the-art (SOTA) performance on large-scale datasets.
arXiv Detail & Related papers (2022-10-14T03:43:05Z) - Comprehensive Graph Gradual Pruning for Sparse Training in Graph Neural
Networks [52.566735716983956]
We propose a graph gradual pruning framework termed CGP to dynamically prune GNNs.
Unlike LTH-based methods, the proposed CGP approach requires no re-training, which significantly reduces the computation costs.
Our proposed strategy greatly improves both training and inference efficiency while matching or even exceeding the accuracy of existing methods.
arXiv Detail & Related papers (2022-07-18T14:23:31Z) - End-to-end Mapping in Heterogeneous Systems Using Graph Representation
Learning [13.810753108848582]
We propose a unified, end-to-end, programmable graph representation learning framework.
It is capable of mining the complexity of high-level programs down to the universal intermediate representation, extracting the specific computational patterns and predicting which code segments would run best on a specific core.
In the evaluation, we demonstrate a maximum speedup of 6.42x compared to the thread-based execution, and 2.02x compared to the state-of-the-art technique.
arXiv Detail & Related papers (2022-04-25T22:13:13Z) - FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z) - SPA-GCN: Efficient and Flexible GCN Accelerator with an Application for
Graph Similarity Computation [7.54579279348595]
We propose a flexible architecture called SPA-GCN for accelerating Graph Convolutional Networks (GCN) on graphs.
We show that SPA-GCN can deliver a high speedup compared to a multi-core CPU implementation and a GPU implementation.
arXiv Detail & Related papers (2021-11-10T20:47:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.