Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics
- URL: http://arxiv.org/abs/2407.19614v2
- Date: Tue, 30 Jul 2024 05:27:59 GMT
- Title: Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics
- Authors: Ryan Stocks, Elise Palethorpe, Giuseppe M. J. Barca,
- Abstract summary: This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock energies and analytic gradients using multiple Graphics Processing Units (GPUs)
The algorithm is especially designed for high throughput emphab initio molecular dynamics simulations of small and medium size molecules (10-100 atoms)
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock (RI-HF) energies and analytic gradients using multiple Graphics Processing Units (GPUs). The algorithm is especially designed for high throughput \emph{ab initio} molecular dynamics simulations of small and medium size molecules (10-100 atoms). Key innovations of this work include the exploitation of multi-GPU parallelism and a workload balancing scheme that efficiently distributes computational tasks among GPUs. Our implementation also employs techniques for symmetry utilization, integral screening and leveraging sparsity to optimize memory usage and computational efficiency. Computational results show that the implementation achieves significant performance improvements, including over $3\times$ speedups in single GPU AIMD throughput compared to previous GPU-accelerated RI-HF and traditional HF methods. Furthermore, utilizing multiple GPUs can provide super-linear speedup when the additional aggregate GPU memory allows for the storage of decompressed three-center integrals. Additionally, we report strong scaling efficiencies for systems up to 1000 basis functions and demonstrate practical applications through extensive performance benchmarks on up to quadruple-$\zeta$ primary basis sets, achieving floating-point performance of up to 47\% of the theoretical peak on a 4$\times$A100 GPU node.
Related papers
- Advanced Techniques for High-Performance Fock Matrix Construction on GPU Clusters [0.0]
opt-UM and opt-Brc introduce significant enhancements to Hartree-Fock caculations up to $f$-type angular momentum functions.
Opt-Brc excels for smaller systems and for highly contracted triple-$zeta$ basis sets, while opt-UM is advantageous for large molecular systems.
arXiv Detail & Related papers (2024-07-31T08:49:06Z) - Optimized thread-block arrangement in a GPU implementation of a linear solver for atmospheric chemistry mechanisms [0.0]
Earth system models (ESM) demand significant hardware resources and energy consumption to solve atmospheric chemistry processes.
Recent studies have shown improved performance from running these models on GPU accelerators.
This study proposes an optimized distribution of the chemical solver's computational load on the GPU, named Block-cells.
arXiv Detail & Related papers (2024-05-27T17:12:59Z) - Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs [3.7101665559244874]
This paper presents a SYCL implementation of Multi-formedLayer Perceptrons (MLPs) for the Intel Data Center GPU Max 1550.
We show with a simple model that this results in a significant increase in arithmetic intensity, leading to improved performance, especially for inference.
arXiv Detail & Related papers (2024-03-26T11:38:39Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - ParaGraph: Weighted Graph Representation for Performance Optimization of
HPC Kernels [1.304892050913381]
We introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree.
We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region.
Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.
arXiv Detail & Related papers (2023-04-07T05:52:59Z) - EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense
Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention.
Our multi-scale linear attention achieves the global receptive field and multi-scale learning.
EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - GPU Domain Specialization via Composable On-Package Architecture [0.8240720472180706]
Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products.
We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs.
arXiv Detail & Related papers (2021-04-05T23:06:50Z) - Large Batch Simulation for Deep Reinforcement Learning [101.01408262583378]
We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work.
We realize end-to-end training speeds of over 19,000 frames of experience per second on a single and up to 72,000 frames per second on a single eight- GPU machine.
By combining batch simulation and performance optimizations, we demonstrate that Point navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system.
arXiv Detail & Related papers (2021-03-12T00:22:50Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.