GraNNDis: Efficient Unified Distributed Training Framework for Deep GNNs
on Large Clusters
- URL: http://arxiv.org/abs/2311.06837v1
- Date: Sun, 12 Nov 2023 13:30:31 GMT
- Title: GraNNDis: Efficient Unified Distributed Training Framework for Deep GNNs
on Large Clusters
- Authors: Jaeyong Song, Hongsun Jang, Jaewon Jung, Youngsok Kim, Jinho Lee
- Abstract summary: Graph neural networks (GNNs) are one of the most rapidly growing fields within deep learning.
GraNNDis is an efficient distributed GNN training framework for training GNNs on large graphs and deep layers.
GraNNDis provides superior speedup over the state-of-the-art distributed GNN training frameworks.
- Score: 8.137466511979586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Graph neural networks (GNNs) are one of the most rapidly growing fields
within deep learning. According to the growth in the dataset and the model size
used for GNNs, an important problem is that it becomes nearly impossible to
keep the whole network on GPU memory. Among numerous attempts, distributed
training is one popular approach to address the problem. However, due to the
nature of GNNs, existing distributed approaches suffer from poor scalability,
mainly due to the slow external server communications.
In this paper, we propose GraNNDis, an efficient distributed GNN training
framework for training GNNs on large graphs and deep layers. GraNNDis
introduces three new techniques. First, shared preloading provides a training
structure for a cluster of multi-GPU servers. We suggest server-wise preloading
of essential vertex dependencies to reduce the low-bandwidth external server
communications. Second, we present expansion-aware sampling. Because shared
preloading alone has limitations because of the neighbor explosion,
expansion-aware sampling reduces vertex dependencies that span across server
boundaries. Third, we propose cooperative batching to create a unified
framework for full-graph and minibatch training. It significantly reduces
redundant memory usage in mini-batch training. From this, GraNNDis enables a
reasonable trade-off between full-graph and mini-batch training through
unification especially when the entire graph does not fit into the GPU memory.
With experiments conducted on a multi-server/multi-GPU cluster, we show that
GraNNDis provides superior speedup over the state-of-the-art distributed GNN
training frameworks.
Related papers
- Distributed Training of Large Graph Neural Networks with Variable Communication Rates [71.7293735221656]
Training Graph Neural Networks (GNNs) on large graphs presents unique challenges due to the large memory and computing requirements.
Distributed GNN training, where the graph is partitioned across multiple machines, is a common approach to training GNNs on large graphs.
We introduce a variable compression scheme for reducing the communication volume in distributed GNN training without compromising the accuracy of the learned model.
arXiv Detail & Related papers (2024-06-25T14:57:38Z) - CATGNN: Cost-Efficient and Scalable Distributed Training for Graph Neural Networks [7.321893519281194]
Existing distributed systems load the entire graph in memory for graph partitioning.
We propose CATGNN, a cost-efficient and scalable distributed GNN training system.
We also propose a novel streaming partitioning algorithm named SPRING for distributed GNN training.
arXiv Detail & Related papers (2024-04-02T20:55:39Z) - Communication-Free Distributed GNN Training with Vertex Cut [63.22674903170953]
CoFree-GNN is a novel distributed GNN training framework that significantly speeds up the training process by implementing communication-free training.
We demonstrate that CoFree-GNN speeds up the GNN training process by up to 10 times over the existing state-of-the-art GNN training approaches.
arXiv Detail & Related papers (2023-08-06T21:04:58Z) - Graph Ladling: Shockingly Simple Parallel GNN Training without
Intermediate Communication [100.51884192970499]
GNNs are a powerful family of neural networks for learning over graphs.
scaling GNNs either by deepening or widening suffers from prevalent issues of unhealthy gradients, over-smoothening, information squashing.
We propose not to deepen or widen current GNNs, but instead present a data-centric perspective of model soups tailored for GNNs.
arXiv Detail & Related papers (2023-06-18T03:33:46Z) - You Can Have Better Graph Neural Networks by Not Training Weights at
All: Finding Untrained GNNs Tickets [105.24703398193843]
Untrainedworks in graph neural networks (GNNs) still remains mysterious.
We show that the found untrainedworks can substantially mitigate the GNN over-smoothing problem.
We also observe that such sparse untrainedworks have appealing performance in out-of-distribution detection and robustness of input perturbations.
arXiv Detail & Related papers (2022-11-28T14:17:36Z) - Distributed Graph Neural Network Training: A Survey [51.77035975191926]
Graph neural networks (GNNs) are a type of deep learning models that are trained on graphs and have been successfully applied in various domains.
Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs.
As a remedy, distributed computing becomes a promising solution of training large-scale GNNs.
arXiv Detail & Related papers (2022-11-01T01:57:00Z) - Learn Locally, Correct Globally: A Distributed Algorithm for Training
Graph Neural Networks [22.728439336309858]
We propose a communication-efficient distributed GNN training technique named $textLearn Locally, Correct Globally$ (LLCG)
LLCG trains a GNN on its local data by ignoring the dependency between nodes among different machines, then sends the locally trained model to the server for periodic model averaging.
We rigorously analyze the convergence of distributed methods with periodic model averaging for training GNNs and show that naively applying periodic model averaging but ignoring the dependency between nodes will suffer from an irreducible residual error.
arXiv Detail & Related papers (2021-11-16T03:07:01Z) - SpreadGNN: Serverless Multi-task Federated Learning for Graph Neural
Networks [13.965982814292971]
Graph Neural Networks (GNNs) are the first choice methods for graph machine learning problems.
Centralizing a massive amount of real-world graph data for GNN training is prohibitive due to user-side privacy concerns.
This work proposes SpreadGNN, a novel multi-task federated training framework.
arXiv Detail & Related papers (2021-06-04T22:20:47Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.