Scalable Training of Trustworthy and Energy-Efficient Predictive Graph Foundation Models for Atomistic Materials Modeling: A Case Study with HydraGNN
- URL: http://arxiv.org/abs/2406.12909v4
- Date: Fri, 01 Nov 2024 17:09:52 GMT
- Title: Scalable Training of Trustworthy and Energy-Efficient Predictive Graph Foundation Models for Atomistic Materials Modeling: A Case Study with HydraGNN
- Authors: Massimiliano Lupo Pasini, Jong Youl Choi, Kshitij Mehta, Pei Zhang, David Rogers, Jonghyun Bae, Khaled Z. Ibrahim, Ashwin M. Aji, Karl W. Schulz, Jorda Polo, Prasanna Balaprakash,
- Abstract summary: We develop and train scalable, trustworthy, and energy-efficient predictive graph foundation models (GFMs) using HydraGNN.
HydraGNN expands the boundaries of graph neural network (GNN) computations in both training scale and data diversity.
Our GFMs use multi-task learning (MTL) to simultaneously learn graph-level and node-level properties of atomistic structures.
- Score: 5.386946356430465
- License:
- Abstract: We present our work on developing and training scalable, trustworthy, and energy-efficient predictive graph foundation models (GFMs) using HydraGNN, a multi-headed graph convolutional neural network architecture. HydraGNN expands the boundaries of graph neural network (GNN) computations in both training scale and data diversity. It abstracts over message passing algorithms, allowing both reproduction of and comparison across algorithmic innovations that define nearest-neighbor convolution in GNNs. This work discusses a series of optimizations that have allowed scaling up the GFMs training to tens of thousands of GPUs on datasets consisting of hundreds of millions of graphs. Our GFMs use multi-task learning (MTL) to simultaneously learn graph-level and node-level properties of atomistic structures, such as energy and atomic forces. Using over 154 million atomistic structures for training, we illustrate the performance of our approach along with the lessons learned on two state-of-the-art United States Department of Energy (US-DOE) supercomputers, namely the Perlmutter petascale system at the National Energy Research Scientific Computing Center and the Frontier exascale system at Oak Ridge Leadership Computing Facility. The HydraGNN architecture enables the GFM to achieve near-linear strong scaling performance using more than 2,000 GPUs on Perlmutter and 16,000 GPUs on Frontier.
Related papers
- Layer-wise training for self-supervised learning on graphs [0.0]
End-to-end training of graph neural networks (GNN) on large graphs presents several memory and computational challenges.
We propose Layer-wise Regularized Graph Infomax, an algorithm to train GNNs layer by layer in a self-supervised manner.
arXiv Detail & Related papers (2023-09-04T10:23:39Z) - Training Graph Neural Networks on Growing Stochastic Graphs [114.75710379125412]
Graph Neural Networks (GNNs) rely on graph convolutions to exploit meaningful patterns in networked data.
We propose to learn GNNs on very large graphs by leveraging the limit object of a sequence of growing graphs, the graphon.
arXiv Detail & Related papers (2022-10-27T16:00:45Z) - A Comprehensive Study on Large-Scale Graph Training: Benchmarking and
Rethinking [124.21408098724551]
Large-scale graph training is a notoriously challenging problem for graph neural networks (GNNs)
We present a new ensembling training manner, named EnGCN, to address the existing issues.
Our proposed method has achieved new state-of-the-art (SOTA) performance on large-scale datasets.
arXiv Detail & Related papers (2022-10-14T03:43:05Z) - Scalable training of graph convolutional neural networks for fast and
accurate predictions of HOMO-LUMO gap in molecules [1.8947048356389908]
This work focuses on building GCNN models on HPC systems to predict material properties of millions of molecules.
We use HydraGNN, our in-house library for large-scale GCNN training, leveraging distributed data parallelism in PyTorch.
We perform parallel training on two open-source large-scale graph datasets to build a GCNN predictor for an important quantum property known as the HOMO-LUMO gap.
arXiv Detail & Related papers (2022-07-22T20:54:22Z) - Towards Training Billion Parameter Graph Neural Networks for Atomic
Simulations [11.328193255838986]
We introduce Graph Parallelism, a method to distribute input graphs across multiple GPUs, enabling us to train very large GNNs with hundreds of millions or billions of parameters.
On the large-scale Open Catalyst 2020 dataset, these graph-parallelized models lead to relative improvements of 1) 15% on the force MAE metric for the S2EF task and 2) 21% on the AFbT metric for the IS2RS task.
arXiv Detail & Related papers (2022-03-18T01:54:34Z) - Neighbor2Seq: Deep Learning on Massive Graphs by Transforming Neighbors
to Sequences [55.329402218608365]
We propose the Neighbor2Seq to transform the hierarchical neighborhood of each node into a sequence.
We evaluate our method on a massive graph with more than 111 million nodes and 1.6 billion edges.
Results show that our proposed method is scalable to massive graphs and achieves superior performance across massive and medium-scale graphs.
arXiv Detail & Related papers (2022-02-07T16:38:36Z) - TC-GNN: Bridging Sparse GNN Computation and Dense Tensor Cores on GPUs [21.63854538768414]
We propose TC-GNN, the first GNN framework based on GPU Core Units (TCUs)
The core idea is to reconcile the "Sparse" GNN with the high-performance "Dense" TCUs.
Rigorous experiments show an average of 1.70 speedup over the state-of-the-art DGL framework.
arXiv Detail & Related papers (2021-12-03T18:06:23Z) - ACE-HGNN: Adaptive Curvature Exploration Hyperbolic Graph Neural Network [72.16255675586089]
We propose an Adaptive Curvature Exploration Hyperbolic Graph NeuralNetwork named ACE-HGNN to adaptively learn the optimal curvature according to the input graph and downstream tasks.
Experiments on multiple real-world graph datasets demonstrate a significant and consistent performance improvement in model quality with competitive performance and good generalization ability.
arXiv Detail & Related papers (2021-10-15T07:18:57Z) - Training Graph Neural Networks with 1000 Layers [133.84813995275988]
We study reversible connections, group convolutions, weight tying, and equilibrium models to advance the memory and parameter efficiency of GNNs.
To the best of our knowledge, RevGNN-Deep is the deepest GNN in the literature by one order of magnitude.
arXiv Detail & Related papers (2021-06-14T15:03:00Z) - A Unified Lottery Ticket Hypothesis for Graph Neural Networks [82.31087406264437]
We present a unified GNN sparsification (UGS) framework that simultaneously prunes the graph adjacency matrix and the model weights.
We further generalize the popular lottery ticket hypothesis to GNNs for the first time, by defining a graph lottery ticket (GLT) as a pair of core sub-dataset and sparse sub-network.
arXiv Detail & Related papers (2021-02-12T21:52:43Z) - SIGN: Scalable Inception Graph Neural Networks [4.5158585619109495]
We propose a new, efficient and scalable graph deep learning architecture that sidesteps the need for graph sampling.
Our architecture allows using different local graph operators to best suit the task at hand.
We obtain state-of-the-art results on ogbn-papers100M, the largest public graph dataset, with over 110 million nodes and 1.5 billion edges.
arXiv Detail & Related papers (2020-04-23T14:46:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.