Simplifying Distributed Neural Network Training on Massive Graphs:
Randomized Partitions Improve Model Aggregation
- URL: http://arxiv.org/abs/2305.09887v1
- Date: Wed, 17 May 2023 01:49:44 GMT
- Title: Simplifying Distributed Neural Network Training on Massive Graphs:
Randomized Partitions Improve Model Aggregation
- Authors: Jiong Zhu, Aishwarya Reganti, Edward Huang, Charles Dickens, Nikhil
Rao, Karthik Subbian, Danai Koutra
- Abstract summary: We present a simplified framework for distributed GNN training that does not rely on the aforementioned costly operations.
Specifically, our framework assembles independent trainers, each of which asynchronously learns a local model on locally-available parts of the training graph.
In experiments on social and e-commerce networks with up to 1.3 billion edges, our proposed RandomTMA and SuperTMA approaches achieve state-of-the-art performance and 2.31x speedup compared to the fastest baseline.
- Score: 23.018715954992352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed training of GNNs enables learning on massive graphs (e.g., social
and e-commerce networks) that exceed the storage and computational capacity of
a single machine. To reach performance comparable to centralized training,
distributed frameworks focus on maximally recovering cross-instance node
dependencies with either communication across instances or periodic fallback to
centralized training, which create overhead and limit the framework
scalability. In this work, we present a simplified framework for distributed
GNN training that does not rely on the aforementioned costly operations, and
has improved scalability, convergence speed and performance over the
state-of-the-art approaches. Specifically, our framework (1) assembles
independent trainers, each of which asynchronously learns a local model on
locally-available parts of the training graph, and (2) only conducts periodic
(time-based) model aggregation to synchronize the local models. Backed by our
theoretical analysis, instead of maximizing the recovery of cross-instance node
dependencies -- which has been considered the key behind closing the
performance gap between model aggregation and centralized training -- , our
framework leverages randomized assignment of nodes or super-nodes (i.e.,
collections of original nodes) to partition the training graph such that it
improves data uniformity and minimizes the discrepancy of gradient and loss
function across instances. In our experiments on social and e-commerce networks
with up to 1.3 billion edges, our proposed RandomTMA and SuperTMA approaches --
despite using less training data -- achieve state-of-the-art performance and
2.31x speedup compared to the fastest baseline, and show better robustness to
trainer failures.
Related papers
- From promise to practice: realizing high-performance decentralized training [8.955918346078935]
Decentralized training of deep neural networks has attracted significant attention for its theoretically superior scalability over synchronous data-parallel methods like All-Reduce.
This paper identifies three key factors that can lead to speedups over All-Reduce training and constructs a runtime model to determine when, how, and to what degree decentralization can yield shorter per-it runtimes.
arXiv Detail & Related papers (2024-10-15T19:04:56Z) - Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices [0.0]
Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters.
We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates.
arXiv Detail & Related papers (2024-01-03T13:07:07Z) - Entropy Aware Training for Fast and Accurate Distributed GNN [0.0]
Several distributed frameworks have been developed to scale Graph Neural Networks (GNNs) on billion-size graphs.
We develop techniques that reduce training time and improve accuracy.
We implement our algorithms on the DistDGL framework and observe that our training techniques scale much better than the existing training approach.
arXiv Detail & Related papers (2023-11-04T13:11:49Z) - Communication-Free Distributed GNN Training with Vertex Cut [63.22674903170953]
CoFree-GNN is a novel distributed GNN training framework that significantly speeds up the training process by implementing communication-free training.
We demonstrate that CoFree-GNN speeds up the GNN training process by up to 10 times over the existing state-of-the-art GNN training approaches.
arXiv Detail & Related papers (2023-08-06T21:04:58Z) - Decentralized Training of Foundation Models in Heterogeneous
Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive.
We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z) - Self-Ensembling GAN for Cross-Domain Semantic Segmentation [107.27377745720243]
This paper proposes a self-ensembling generative adversarial network (SE-GAN) exploiting cross-domain data for semantic segmentation.
In SE-GAN, a teacher network and a student network constitute a self-ensembling model for generating semantic segmentation maps, which together with a discriminator, forms a GAN.
Despite its simplicity, we find SE-GAN can significantly boost the performance of adversarial training and enhance the stability of the model.
arXiv Detail & Related papers (2021-12-15T09:50:25Z) - Learn Locally, Correct Globally: A Distributed Algorithm for Training
Graph Neural Networks [22.728439336309858]
We propose a communication-efficient distributed GNN training technique named $textLearn Locally, Correct Globally$ (LLCG)
LLCG trains a GNN on its local data by ignoring the dependency between nodes among different machines, then sends the locally trained model to the server for periodic model averaging.
We rigorously analyze the convergence of distributed methods with periodic model averaging for training GNNs and show that naively applying periodic model averaging but ignoring the dependency between nodes will suffer from an irreducible residual error.
arXiv Detail & Related papers (2021-11-16T03:07:01Z) - DANCE: DAta-Network Co-optimization for Efficient Segmentation Model
Training and Inference [85.02494022662505]
DANCE is an automated simultaneous data-network co-optimization for efficient segmentation model training and inference.
It integrates automated data slimming which adaptively downsamples/drops input images and controls their corresponding contribution to the training loss guided by the images' spatial complexity.
Experiments and ablating studies demonstrate that DANCE can achieve "all-win" towards efficient segmentation.
arXiv Detail & Related papers (2021-07-16T04:58:58Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - Clustered Federated Learning via Generalized Total Variation
Minimization [83.26141667853057]
We study optimization methods to train local (or personalized) models for local datasets with a decentralized network structure.
Our main conceptual contribution is to formulate federated learning as total variation minimization (GTV)
Our main algorithmic contribution is a fully decentralized federated learning algorithm.
arXiv Detail & Related papers (2021-05-26T18:07:19Z) - Decentralized Statistical Inference with Unrolled Graph Neural Networks [26.025935320024665]
We propose a learning-based framework, which unrolls decentralized optimization algorithms into graph neural networks (GNNs)
By minimizing the recovery error via end-to-end training, this learning-based framework resolves the model mismatch issue.
Our convergence analysis reveals that the learned model parameters may accelerate the convergence and reduce the recovery error to a large extent.
arXiv Detail & Related papers (2021-04-04T07:52:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.