Related papers: Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication

Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication

URL: http://arxiv.org/abs/2303.01277v1
Date: Thu, 2 Mar 2023 14:02:39 GMT
Title: Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication
Authors: Meng Zhang, Qinghao Hu, Peng Sun, Yonggang Wen, Tianwei Zhang
Abstract summary: Training Graph Neural Networks (GNNs) on large graphs is challenging due to the conflict between the high memory demand and limited GPU memory. We propose an efficient distributed GNN training framework Sylvie, which employs one-bit quantization computation technique in GNNs. In detail, Sylvie provides a lightweight Low-bit Module to quantize the sent data and dequantize the received data back to full precision values in each layer.
Score: 23.883543151975136
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training Graph Neural Networks (GNNs) on large graphs is challenging due to the conflict between the high memory demand and limited GPU memory. Recently, distributed full-graph GNN training has been widely adopted to tackle this problem. However, the substantial inter-GPU communication overhead can cause severe throughput degradation. Existing communication compression techniques mainly focus on traditional DNN training, whose bottleneck lies in synchronizing gradients and parameters. We find they do not work well in distributed GNN training as the barrier is the layer-wise communication of features during the forward pass & feature gradients during the backward pass. To this end, we propose an efficient distributed GNN training framework Sylvie, which employs one-bit quantization technique in GNNs and further pipelines the curtailed communication with computation to enormously shrink the overhead while maintaining the model quality. In detail, Sylvie provides a lightweight Low-bit Module to quantize the sent data and dequantize the received data back to full precision values in each layer. Additionally, we propose a Bounded Staleness Adaptor to control the introduced staleness to achieve further performance enhancement. We conduct theoretical convergence analysis and extensive experiments on various models & datasets to demonstrate Sylvie can considerably boost the training throughput by up to 28.1x.

Related papers

Distributed Training of Large Graph Neural Networks with Variable Communication Rates [71.7293735221656]
Training Graph Neural Networks (GNNs) on large graphs presents unique challenges due to the large memory and computing requirements. Distributed GNN training, where the graph is partitioned across multiple machines, is a common approach to training GNNs on large graphs. We introduce a variable compression scheme for reducing the communication volume in distributed GNN training without compromising the accuracy of the learned model.
arXiv Detail & Related papers (2024-06-25T14:57:38Z)
Label Deconvolution for Node Representation Learning on Large-scale Attributed Graphs against Learning Bias [75.44877675117749]
We propose an efficient label regularization technique, namely Label Deconvolution (LD), to alleviate the learning bias by a novel and highly scalable approximation to the inverse mapping of GNNs. Experiments demonstrate LD significantly outperforms state-of-the-art methods on Open Graph datasets Benchmark.
arXiv Detail & Related papers (2023-09-26T13:09:43Z)
Staleness-Alleviated Distributed GNN Training via Online Dynamic-Embedding Prediction [13.575053193557697]
This paper proposes SAT (Staleness-Alleviated Training), a novel and scalable distributed GNN training framework. The key idea of SAT is to model the GNN's embedding evolution as a temporal graph and build a model upon it to predict future embedding. Empirically, we demonstrate that SAT can effectively reduce embedding staleness and thus achieve better performance and convergence speed.
arXiv Detail & Related papers (2023-08-25T16:10:44Z)
Communication-Free Distributed GNN Training with Vertex Cut [63.22674903170953]
CoFree-GNN is a novel distributed GNN training framework that significantly speeds up the training process by implementing communication-free training. We demonstrate that CoFree-GNN speeds up the GNN training process by up to 10 times over the existing state-of-the-art GNN training approaches.
arXiv Detail & Related papers (2023-08-06T21:04:58Z)
Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training [6.557328947642343]
Distributed full-graph training of Graph Neural Networks (GNNs) over large graphs is bandwidth-demanding and time-consuming. This paper proposes an efficient GNN training system, AdaQP, to expedite distributed full-graph training.
arXiv Detail & Related papers (2023-06-02T09:02:09Z)
Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency. We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training. We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z)
A Comprehensive Study on Large-Scale Graph Training: Benchmarking and Rethinking [124.21408098724551]
Large-scale graph training is a notoriously challenging problem for graph neural networks (GNNs) We present a new ensembling training manner, named EnGCN, to address the existing issues. Our proposed method has achieved new state-of-the-art (SOTA) performance on large-scale datasets.
arXiv Detail & Related papers (2022-10-14T03:43:05Z)
Distributed Graph Neural Network Training with Periodic Historical Embedding Synchronization [9.503080586294406]
Graph Neural Networks (GNNs) are prevalent in various applications such as social network, recommender systems, and knowledge graphs. Traditional sampling-based methods accelerate GNN by dropping edges and nodes, which impairs the graph integrity and model performance. This paper proposes DIstributed Graph Embedding SynchronizaTion (DIGEST), a novel distributed GNN training framework.
arXiv Detail & Related papers (2022-05-31T18:44:53Z)
CAP: Co-Adversarial Perturbation on Weights and Features for Improving Generalization of Graph Neural Networks [59.692017490560275]
Adversarial training has been widely demonstrated to improve model's robustness against adversarial attacks. It remains unclear how the adversarial training could improve the generalization abilities of GNNs in the graph analytics problem. We construct the co-adversarial perturbation (CAP) optimization problem in terms of weights and features, and design the alternating adversarial perturbation algorithm to flatten the weight and feature loss landscapes alternately.
arXiv Detail & Related papers (2021-10-28T02:28:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.