Large Graph Convolutional Network Training with GPU-Oriented Data
Communication Architecture
- URL: http://arxiv.org/abs/2103.03330v1
- Date: Thu, 4 Mar 2021 21:00:17 GMT
- Title: Large Graph Convolutional Network Training with GPU-Oriented Data
Communication Architecture
- Authors: Seung Won Min, Kun Wu, Sitao Huang, Mert Hidayeto\u{g}lu, Jinjun
Xiong, Eiman Ebrahimi, Deming Chen, Wen-mei Hwu
- Abstract summary: Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems.
Current GCN training systems keep the feature table in host memory and rely on the CPU to collect sparse features.
This approach, however, puts tremendous pressure on host memory bandwidth and the CPU.
We propose a novel GPU-oriented data communication approach for GCN training, where GPU threads directly access sparse features in host memory.
- Score: 19.2129567657739
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale
graph-based recommender systems. Training GCN requires the minibatch generator
traversing graphs and sampling the sparsely located neighboring nodes to obtain
their features. Since real-world graphs often exceed the capacity of GPU
memory, current GCN training systems keep the feature table in host memory and
rely on the CPU to collect sparse features before sending them to the GPUs.
This approach, however, puts tremendous pressure on host memory bandwidth and
the CPU. This is because the CPU needs to (1) read sparse features from memory,
(2) write features into memory as a dense format, and (3) transfer the features
from memory to the GPUs. In this work, we propose a novel GPU-oriented data
communication approach for GCN training, where GPU threads directly access
sparse features in host memory through zero-copy accesses without much CPU
help. By removing the CPU gathering stage, our method significantly reduces the
consumption of the host resources and data access latency. We further present
two important techniques to achieve high host memory access efficiency by the
GPU: (1) automatic data access address alignment to maximize PCIe packet
efficiency, and (2) asynchronous zero-copy access and kernel execution to fully
overlap data transfer with training. We incorporate our method into PyTorch and
evaluate its effectiveness using several graphs with sizes up to 111 million
nodes and 1.6 billion edges. In a multi-GPU training setup, our method is
65-92% faster than the conventional data transfer method, and can even match
the performance of all-in-GPU-memory training for some graphs that fit in GPU
memory.
Related papers
- Accelerating Sampling and Aggregation Operations in GNN Frameworks with
GPU Initiated Direct Storage Accesses [9.773813896475264]
Graph Neural Networks (GNNs) are emerging as a powerful tool for learning from graph-structured data.
Training GNNs on large-scale graphs remains a significant challenge due to lack of efficient data access and data movement methods.
We propose the GPU Initiated Direct Storage Access (GIDS) dataloader to enable GPU-oriented GNN training for large-scale graphs.
arXiv Detail & Related papers (2023-06-28T17:22:15Z) - Communication-Efficient Graph Neural Networks with Probabilistic
Neighborhood Expansion Analysis and Caching [59.8522166385372]
Training and inference with graph neural networks (GNNs) on massive graphs has been actively studied since the inception of GNNs.
This paper is concerned with minibatch training and inference with GNNs that employ node-wise sampling in distributed settings.
We present SALIENT++, which extends the prior state-of-the-art SALIENT system to work with partitioned feature data.
arXiv Detail & Related papers (2023-05-04T21:04:01Z) - Heterogeneous Acceleration Pipeline for Recommendation System Training [1.8457649813040096]
Recommendation models rely on deep learning networks and large embedding tables.
These models are typically trained using hybrid-GPU or GPU-only configurations.
This paper introduces Hotline, a heterogeneous CPU acceleration pipeline.
arXiv Detail & Related papers (2022-04-11T23:10:41Z) - Scaling R-GCN Training with Graph Summarization [71.06855946732296]
Training of Relation Graph Convolutional Networks (R-GCN) does not scale well with the size of the graph.
In this work, we experiment with the use of graph summarization techniques to compress the graph.
We obtain reasonable results on the AIFB, MUTAG and AM datasets.
arXiv Detail & Related papers (2022-03-05T00:28:43Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z) - MG-GCN: Scalable Multi-GPU GCN Training Framework [1.7188280334580197]
Full batch training of Graph Convolutional Network (GCN) models is not feasible on a single GPU for large graphs.
MG-GCN employs multiple High-Performance Computing optimizations, including efficient re-use of memory buffers.
MG-GCN achieves super-linear speedup with respect to DGL, on the Reddit graph on both DGX-1 (V100) and DGX-A100.
arXiv Detail & Related papers (2021-10-17T00:41:43Z) - Efficient Scaling of Dynamic Graph Neural Networks [7.313571385612325]
This is the first scaling study on dynamic Graph Neural Networks.
We devise mechanisms for reducing the GPU memory usage.
We design a graph difference-based strategy to significantly reduce the transfer time.
arXiv Detail & Related papers (2021-09-16T11:51:20Z) - Global Neighbor Sampling for Mixed CPU-GPU Training on Giant Graphs [26.074384252289384]
Graph neural networks (GNNs) are powerful tools for learning from graph data and are widely used in various applications.
Despite a number of sampling-based methods have been proposed to enable mini-batch training on large graphs, these methods have not been proved to work on truly industry-scale graphs.
We propose Global Neighborhood Sampling that aims at training GNNs on giant graphs specifically for mixed- CPU-GPU training.
arXiv Detail & Related papers (2021-06-11T03:30:25Z) - Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters.
It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions.
We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z) - Efficient Video Semantic Segmentation with Labels Propagation and
Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach.
We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next.
On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.