Expediting Distributed DNN Training with Device Topology-Aware Graph
Deployment
- URL: http://arxiv.org/abs/2302.06126v1
- Date: Mon, 13 Feb 2023 06:30:24 GMT
- Title: Expediting Distributed DNN Training with Device Topology-Aware Graph
Deployment
- Authors: Shiwei Zhang, Xiaodong Yi, Lansong Diao, Chuan Wu, Siyu Wang, and Wei
Lin
- Abstract summary: TAG is an automatic system to derive optimized DNN training graph and its deployment onto any device topology.
We show that it can achieve up to 4.56x training speed-up as compared to existing schemes.
- Score: 18.021259939659874
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents TAG, an automatic system to derive optimized DNN training
graph and its deployment onto any device topology, for expedited training in
device- and topology- heterogeneous ML clusters. We novelly combine both the
DNN computation graph and the device topology graph as input to a graph neural
network (GNN), and join the GNN with a search-based method to quickly identify
optimized distributed training strategies. To reduce communication in a
heterogeneous cluster, we further explore a lossless gradient compression
technique and solve a combinatorial optimization problem to automatically apply
the technique for training time minimization. We evaluate TAG with various
representative DNN models and device topologies, showing that it can achieve up
to 4.56x training speed-up as compared to existing schemes. TAG can produce
efficient deployment strategies for both unseen DNN models and unseen device
topologies, without heavy fine-tuning.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Graph Coordinates and Conventional Neural Networks -- An Alternative for
Graph Neural Networks [0.10923877073891444]
We propose Topology Coordinate Neural Network (TCNN) and Directional Virtual Coordinate Neural Network (DVCNN) as novel alternatives to message passing GNNs.
TCNN and DVCNN achieve competitive or superior performance to message passing GNNs.
Our work expands the toolbox of techniques for graph-based machine learning.
arXiv Detail & Related papers (2023-12-03T10:14:10Z) - Efficient Heterogeneous Graph Learning via Random Projection [58.4138636866903]
Heterogeneous Graph Neural Networks (HGNNs) are powerful tools for deep learning on heterogeneous graphs.
Recent pre-computation-based HGNNs use one-time message passing to transform a heterogeneous graph into regular-shaped tensors.
We propose a hybrid pre-computation-based HGNN, named Random Projection Heterogeneous Graph Neural Network (RpHGNN)
arXiv Detail & Related papers (2023-10-23T01:25:44Z) - Communication-Free Distributed GNN Training with Vertex Cut [63.22674903170953]
CoFree-GNN is a novel distributed GNN training framework that significantly speeds up the training process by implementing communication-free training.
We demonstrate that CoFree-GNN speeds up the GNN training process by up to 10 times over the existing state-of-the-art GNN training approaches.
arXiv Detail & Related papers (2023-08-06T21:04:58Z) - Learning Cooperative Beamforming with Edge-Update Empowered Graph Neural
Networks [29.23937571816269]
We propose an edge-graph-neural-network (Edge-GNN) to learn the cooperative beamforming on the graph edges.
The proposed Edge-GNN achieves higher sum rate with much shorter computation time than state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-23T02:05:06Z) - Distributed Graph Neural Network Training: A Survey [51.77035975191926]
Graph neural networks (GNNs) are a type of deep learning models that are trained on graphs and have been successfully applied in various domains.
Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs.
As a remedy, distributed computing becomes a promising solution of training large-scale GNNs.
arXiv Detail & Related papers (2022-11-01T01:57:00Z) - GNN at the Edge: Cost-Efficient Graph Neural Network Processing over
Distributed Edge Servers [24.109721494781592]
Graph Neural Networks (GNNs) are still under exploration, presenting a stark disparity to its broad edge adoptions.
This paper studies the cost optimization for distributed GNN processing over a multi-tier heterogeneous edge network.
We show that our approach achieves superior performance over de facto baselines with more than 95.8% cost eduction in a fast convergence speed.
arXiv Detail & Related papers (2022-10-31T13:03:16Z) - Distributed Graph Neural Network Training with Periodic Historical
Embedding Synchronization [9.503080586294406]
Graph Neural Networks (GNNs) are prevalent in various applications such as social network, recommender systems, and knowledge graphs.
Traditional sampling-based methods accelerate GNN by dropping edges and nodes, which impairs the graph integrity and model performance.
This paper proposes DIstributed Graph Embedding SynchronizaTion (DIGEST), a novel distributed GNN training framework.
arXiv Detail & Related papers (2022-05-31T18:44:53Z) - Learn Locally, Correct Globally: A Distributed Algorithm for Training
Graph Neural Networks [22.728439336309858]
We propose a communication-efficient distributed GNN training technique named $textLearn Locally, Correct Globally$ (LLCG)
LLCG trains a GNN on its local data by ignoring the dependency between nodes among different machines, then sends the locally trained model to the server for periodic model averaging.
We rigorously analyze the convergence of distributed methods with periodic model averaging for training GNNs and show that naively applying periodic model averaging but ignoring the dependency between nodes will suffer from an irreducible residual error.
arXiv Detail & Related papers (2021-11-16T03:07:01Z) - Fast Learning of Graph Neural Networks with Guaranteed Generalizability:
One-hidden-layer Case [93.37576644429578]
Graph neural networks (GNNs) have made great progress recently on learning from graph-structured data in practice.
We provide a theoretically-grounded generalizability analysis of GNNs with one hidden layer for both regression and binary classification problems.
arXiv Detail & Related papers (2020-06-25T00:45:52Z) - Binarized Graph Neural Network [65.20589262811677]
We develop a binarized graph neural network to learn the binary representations of the nodes with binary network parameters.
Our proposed method can be seamlessly integrated into the existing GNN-based embedding approaches.
Experiments indicate that the proposed binarized graph neural network, namely BGN, is orders of magnitude more efficient in terms of both time and space.
arXiv Detail & Related papers (2020-04-19T09:43:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.