Cooperative Minibatching in Graph Neural Networks
- URL: http://arxiv.org/abs/2310.12403v3
- Date: Fri, 6 Sep 2024 20:14:17 GMT
- Title: Cooperative Minibatching in Graph Neural Networks
- Authors: Muhammed Fatih Balin, Dominique LaSalle, Ümit V. Çatalyürek,
- Abstract summary: Training Graph Neural Networks (GNNs) requires significant computational resources, and the process is highly data-intensive.
One of the most effective ways to reduce resource requirements is minibatch training coupled with graph sampling.
We show how to take advantage of the same phenomenon in serial execution by generating dependent consecutive minibatches.
- Score: 2.9904113489777826
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training large scale Graph Neural Networks (GNNs) requires significant computational resources, and the process is highly data-intensive. One of the most effective ways to reduce resource requirements is minibatch training coupled with graph sampling. GNNs have the unique property that items in a minibatch have overlapping data. However, the commonly implemented Independent Minibatching approach assigns each Processing Element (PE, i.e., cores and/or GPUs) its own minibatch to process, leading to duplicated computations and input data access across PEs. This amplifies the Neighborhood Explosion Phenomenon (NEP), which is the main bottleneck limiting scaling. To reduce the effects of NEP in the multi-PE setting, we propose a new approach called Cooperative Minibatching. Our approach capitalizes on the fact that the size of the sampled subgraph is a concave function of the batch size, leading to significant reductions in the amount of work as batch sizes increase. Hence, it is favorable for processors equipped with a fast interconnect to work on a large minibatch together as a single larger processor, instead of working on separate smaller minibatches, even though global batch size is identical. We also show how to take advantage of the same phenomenon in serial execution by generating dependent consecutive minibatches. Our experimental evaluations show up to 4x bandwidth savings for fetching vertex embeddings, by simply increasing this dependency without harming model convergence. Combining our proposed approaches, we achieve up to 64% speedup over Independent Minibatching on single-node multi-GPU systems, using same resources.
Related papers
- Slicing Input Features to Accelerate Deep Learning: A Case Study with Graph Neural Networks [0.24578723416255746]
This paper introduces SliceGCN, a feature-sliced distributed large-scale graph learning method.
It aims to avoid accuracy loss typically associated with mini-batch training and to reduce inter- GPU communication.
Experiments were conducted on six node classification datasets, yielding some interesting analytical results.
arXiv Detail & Related papers (2024-08-21T10:18:41Z) - Distributed Matrix-Based Sampling for Graph Neural Network Training [0.0]
We propose a matrix-based bulk sampling approach that expresses sampling as a sparse matrix multiplication (SpGEMM) and samples multiple minibatches at once.
When the input graph topology does not fit on a single device, our method distributes the graph and use communication-avoiding SpGEMM algorithms to scale GNN minibatch sampling.
In addition to new methods for sampling, we introduce a pipeline that uses our matrix-based bulk sampling approach to provide end-to-end training results.
arXiv Detail & Related papers (2023-11-06T06:40:43Z) - GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism [6.3568605707961]
Mini-batch training is commonly used to train Graph neural networks (GNNs) on large graphs.
In this paper, we introduce a hybrid parallel mini-batch training paradigm called split parallelism.
We show that split parallelism outperforms state-of-the-art mini-batch training systems like DGL, Quiver, and $P3$.
arXiv Detail & Related papers (2023-03-24T03:28:05Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality.
A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent.
We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - Global Neighbor Sampling for Mixed CPU-GPU Training on Giant Graphs [26.074384252289384]
Graph neural networks (GNNs) are powerful tools for learning from graph data and are widely used in various applications.
Despite a number of sampling-based methods have been proposed to enable mini-batch training on large graphs, these methods have not been proved to work on truly industry-scale graphs.
We propose Global Neighborhood Sampling that aims at training GNNs on giant graphs specifically for mixed- CPU-GPU training.
arXiv Detail & Related papers (2021-06-11T03:30:25Z) - Accurate, Efficient and Scalable Training of Graph Neural Networks [9.569918335816963]
Graph Neural Networks (GNNs) are powerful deep learning models to generate node embeddings on graphs.
It is still challenging to perform training in an efficient and scalable way.
We propose a novel parallel training framework that reduces training workload by orders of magnitude compared with state-of-the-art minibatch methods.
arXiv Detail & Related papers (2020-10-05T22:06:23Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.