DistDGL: Distributed Graph Neural Network Training for Billion-Scale
Graphs
- URL: http://arxiv.org/abs/2010.05337v3
- Date: Mon, 2 Aug 2021 18:30:04 GMT
- Title: DistDGL: Distributed Graph Neural Network Training for Billion-Scale
Graphs
- Authors: Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song,
Quan Gan, Zheng Zhang, George Karypis
- Abstract summary: DistDGL is a system for training GNNs in a mini-batch fashion on a cluster of machines.
It is based on the Deep Graph Library (DGL), a popular GNN development framework.
Our results show that DistDGL achieves linear speedup without compromising model accuracy.
- Score: 22.63888380481248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Graph neural networks (GNN) have shown great success in learning from
graph-structured data. They are widely used in various applications, such as
recommendation, fraud detection, and search. In these domains, the graphs are
typically large, containing hundreds of millions of nodes and several billions
of edges. To tackle this challenge, we develop DistDGL, a system for training
GNNs in a mini-batch fashion on a cluster of machines. DistDGL is based on the
Deep Graph Library (DGL), a popular GNN development framework. DistDGL
distributes the graph and its associated data (initial features and embeddings)
across the machines and uses this distribution to derive a computational
decomposition by following an owner-compute rule. DistDGL follows a synchronous
training approach and allows ego-networks forming the mini-batches to include
non-local nodes. To minimize the overheads associated with distributed
computations, DistDGL uses a high-quality and light-weight min-cut graph
partitioning algorithm along with multiple balancing constraints. This allows
it to reduce communication overheads and statically balance the computations.
It further reduces the communication by replicating halo nodes and by using
sparse embedding updates. The combination of these design choices allows
DistDGL to train high-quality models while achieving high parallel efficiency
and memory scalability. We demonstrate our optimizations on both inductive and
transductive GNN models. Our results show that DistDGL achieves linear speedup
without compromising model accuracy and requires only 13 seconds to complete a
training epoch for a graph with 100 million nodes and 3 billion edges on a
cluster with 16 machines. DistDGL is now publicly available as part of
DGL:https://github.com/dmlc/dgl/tree/master/python/dgl/distributed.
Related papers
- FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale [29.272368697268433]
Graph Neural Networks (GNNs) have shown great superiority on non-Euclidean graph data.
We propose FastGL, a GPU-efficient Framework for accelerating sampling-based training of GNN at Large scale.
FastGL can achieve an average speedup of 11.8x, 2.2x and 1.5x over the state-of-the-art frameworks PyG, DGL, and GNNLab, respectively.
arXiv Detail & Related papers (2024-09-23T11:45:47Z) - GLISP: A Scalable GNN Learning System by Exploiting Inherent Structural
Properties of Graphs [5.410321469222541]
We propose GLISP, a sampling based GNN learning system for industrial scale graphs.
GLISP consists of three core components: graph partitioner, graph sampling service and graph inference engine.
Experiments show that GLISP achieves up to $6.53times$ and $70.77times$ speedups over existing GNN systems for training and inference tasks.
arXiv Detail & Related papers (2024-01-06T02:59:24Z) - Graph Transformers for Large Graphs [57.19338459218758]
This work advances representation learning on single large-scale graphs with a focus on identifying model characteristics and critical design constraints.
A key innovation of this work lies in the creation of a fast neighborhood sampling technique coupled with a local attention mechanism.
We report a 3x speedup and 16.8% performance gain on ogbn-products and snap-patents, while we also scale LargeGT on ogbn-100M with a 5.9% performance improvement.
arXiv Detail & Related papers (2023-12-18T11:19:23Z) - Communication-Free Distributed GNN Training with Vertex Cut [63.22674903170953]
CoFree-GNN is a novel distributed GNN training framework that significantly speeds up the training process by implementing communication-free training.
We demonstrate that CoFree-GNN speeds up the GNN training process by up to 10 times over the existing state-of-the-art GNN training approaches.
arXiv Detail & Related papers (2023-08-06T21:04:58Z) - DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training [18.52206409432894]
DistTGL is an efficient and scalable solution to train memory-based TGNNs on distributed GPU clusters.
In experiments, DistTGL achieves near-linear convergence speedup, outperforming state-of-the-art single-machine method by 14.5% in accuracy and 10.17x in training throughput.
arXiv Detail & Related papers (2023-07-14T22:52:27Z) - Training Graph Neural Networks on Growing Stochastic Graphs [114.75710379125412]
Graph Neural Networks (GNNs) rely on graph convolutions to exploit meaningful patterns in networked data.
We propose to learn GNNs on very large graphs by leveraging the limit object of a sequence of growing graphs, the graphon.
arXiv Detail & Related papers (2022-10-27T16:00:45Z) - TGL: A General Framework for Temporal GNN Training on Billion-Scale
Graphs [17.264420882897017]
We propose TGL, a unified framework for large-scale offline Temporal Graph Neural Network training.
TGL comprises five main components, a temporal sampler, a mailbox, a node memory module, a memory updater, and a message passing engine.
We introduce two large-scale real-world datasets with 0.2 and 1.3 billion temporal edges.
arXiv Detail & Related papers (2022-03-28T16:41:18Z) - BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and
Preprocessing [0.0]
Graph neural networks (GNNs) have extended the success of deep neural networks (DNNs) to non-Euclidean graph data.
Existing systems are inefficient to train large graphs with billions of nodes and edges with GPUs.
This paper proposes BGL, a distributed GNN training system designed to address the bottlenecks with a few key ideas.
arXiv Detail & Related papers (2021-12-16T00:37:37Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z) - A Unified Lottery Ticket Hypothesis for Graph Neural Networks [82.31087406264437]
We present a unified GNN sparsification (UGS) framework that simultaneously prunes the graph adjacency matrix and the model weights.
We further generalize the popular lottery ticket hypothesis to GNNs for the first time, by defining a graph lottery ticket (GLT) as a pair of core sub-dataset and sparse sub-network.
arXiv Detail & Related papers (2021-02-12T21:52:43Z) - Iterative Deep Graph Learning for Graph Neural Networks: Better and
Robust Node Embeddings [53.58077686470096]
We propose an end-to-end graph learning framework, namely Iterative Deep Graph Learning (IDGL) for jointly and iteratively learning graph structure and graph embedding.
Our experiments show that our proposed IDGL models can consistently outperform or match the state-of-the-art baselines.
arXiv Detail & Related papers (2020-06-21T19:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.