Communication-Free Distributed GNN Training with Vertex Cut
- URL: http://arxiv.org/abs/2308.03209v1
- Date: Sun, 6 Aug 2023 21:04:58 GMT
- Title: Communication-Free Distributed GNN Training with Vertex Cut
- Authors: Kaidi Cao, Rui Deng, Shirley Wu, Edward W Huang, Karthik Subbian, Jure
Leskovec
- Abstract summary: CoFree-GNN is a novel distributed GNN training framework that significantly speeds up the training process by implementing communication-free training.
We demonstrate that CoFree-GNN speeds up the GNN training process by up to 10 times over the existing state-of-the-art GNN training approaches.
- Score: 63.22674903170953
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training Graph Neural Networks (GNNs) on real-world graphs consisting of
billions of nodes and edges is quite challenging, primarily due to the
substantial memory needed to store the graph and its intermediate node and edge
features, and there is a pressing need to speed up the training process. A
common approach to achieve speed up is to divide the graph into many smaller
subgraphs, which are then distributed across multiple GPUs in one or more
machines and processed in parallel. However, existing distributed methods
require frequent and substantial cross-GPU communication, leading to
significant time overhead and progressively diminishing scalability. Here, we
introduce CoFree-GNN, a novel distributed GNN training framework that
significantly speeds up the training process by implementing communication-free
training. The framework utilizes a Vertex Cut partitioning, i.e., rather than
partitioning the graph by cutting the edges between partitions, the Vertex Cut
partitions the edges and duplicates the node information to preserve the graph
structure. Furthermore, the framework maintains high model accuracy by
incorporating a reweighting mechanism to handle a distorted graph distribution
that arises from the duplicated nodes. We also propose a modified DropEdge
technique to further speed up the training process. Using an extensive set of
experiments on real-world networks, we demonstrate that CoFree-GNN speeds up
the GNN training process by up to 10 times over the existing state-of-the-art
GNN training approaches.
Related papers
- MassiveGNN: Efficient Training via Prefetching for Massively Connected Distributed Graphs [11.026326555186333]
This paper develops a parameterized continuous prefetch and eviction scheme on top of the state-of-the-art Amazon DistDGL distributed GNN framework.
It demonstrates about 15-40% improvement in end-to-end training performance on the National Energy Research Scientific Computing Center's (NERSC) Perlmutter supercomputer.
arXiv Detail & Related papers (2024-10-30T05:10:38Z) - Distributed Training of Large Graph Neural Networks with Variable Communication Rates [71.7293735221656]
Training Graph Neural Networks (GNNs) on large graphs presents unique challenges due to the large memory and computing requirements.
Distributed GNN training, where the graph is partitioned across multiple machines, is a common approach to training GNNs on large graphs.
We introduce a variable compression scheme for reducing the communication volume in distributed GNN training without compromising the accuracy of the learned model.
arXiv Detail & Related papers (2024-06-25T14:57:38Z) - CATGNN: Cost-Efficient and Scalable Distributed Training for Graph Neural Networks [7.321893519281194]
Existing distributed systems load the entire graph in memory for graph partitioning.
We propose CATGNN, a cost-efficient and scalable distributed GNN training system.
We also propose a novel streaming partitioning algorithm named SPRING for distributed GNN training.
arXiv Detail & Related papers (2024-04-02T20:55:39Z) - Adaptive Message Quantization and Parallelization for Distributed
Full-graph GNN Training [6.557328947642343]
Distributed full-graph training of Graph Neural Networks (GNNs) over large graphs is bandwidth-demanding and time-consuming.
This paper proposes an efficient GNN training system, AdaQP, to expedite distributed full-graph training.
arXiv Detail & Related papers (2023-06-02T09:02:09Z) - Scalable Graph Convolutional Network Training on Distributed-Memory
Systems [5.169989177779801]
Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs.
Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges.
We propose a highly parallel training algorithm that scales to large processor counts.
arXiv Detail & Related papers (2022-12-09T17:51:13Z) - Training Graph Neural Networks on Growing Stochastic Graphs [114.75710379125412]
Graph Neural Networks (GNNs) rely on graph convolutions to exploit meaningful patterns in networked data.
We propose to learn GNNs on very large graphs by leveraging the limit object of a sequence of growing graphs, the graphon.
arXiv Detail & Related papers (2022-10-27T16:00:45Z) - Neural Graph Matching for Pre-training Graph Neural Networks [72.32801428070749]
Graph neural networks (GNNs) have been shown powerful capacity at modeling structural data.
We present a novel Graph Matching based GNN Pre-Training framework, called GMPT.
The proposed method can be applied to fully self-supervised pre-training and coarse-grained supervised pre-training.
arXiv Detail & Related papers (2022-03-03T09:53:53Z) - Increase and Conquer: Training Graph Neural Networks on Growing Graphs [116.03137405192356]
We consider the problem of learning a graphon neural network (WNN) by training GNNs on graphs sampled Bernoulli from the graphon.
Inspired by these results, we propose an algorithm to learn GNNs on large-scale graphs that, starting from a moderate number of nodes, successively increases the size of the graph during training.
arXiv Detail & Related papers (2021-06-07T15:05:59Z) - Distributed Training of Graph Convolutional Networks using Subgraph
Approximation [72.89940126490715]
We propose a training strategy that mitigates the lost information across multiple partitions of a graph through a subgraph approximation scheme.
The subgraph approximation approach helps the distributed training system converge at single-machine accuracy.
arXiv Detail & Related papers (2020-12-09T09:23:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.