Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics
- URL: http://arxiv.org/abs/2407.19614v2
- Date: Tue, 30 Jul 2024 05:27:59 GMT
- Title: Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics
- Authors: Ryan Stocks, Elise Palethorpe, Giuseppe M. J. Barca,
- Abstract summary: This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock energies and analytic gradients using multiple Graphics Processing Units (GPUs)
The algorithm is especially designed for high throughput emphab initio molecular dynamics simulations of small and medium size molecules (10-100 atoms)
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock (RI-HF) energies and analytic gradients using multiple Graphics Processing Units (GPUs). The algorithm is especially designed for high throughput \emph{ab initio} molecular dynamics simulations of small and medium size molecules (10-100 atoms). Key innovations of this work include the exploitation of multi-GPU parallelism and a workload balancing scheme that efficiently distributes computational tasks among GPUs. Our implementation also employs techniques for symmetry utilization, integral screening and leveraging sparsity to optimize memory usage and computational efficiency. Computational results show that the implementation achieves significant performance improvements, including over $3\times$ speedups in single GPU AIMD throughput compared to previous GPU-accelerated RI-HF and traditional HF methods. Furthermore, utilizing multiple GPUs can provide super-linear speedup when the additional aggregate GPU memory allows for the storage of decompressed three-center integrals. Additionally, we report strong scaling efficiencies for systems up to 1000 basis functions and demonstrate practical applications through extensive performance benchmarks on up to quadruple-$\zeta$ primary basis sets, achieving floating-point performance of up to 47\% of the theoretical peak on a 4$\times$A100 GPU node.
Related papers
- GPU-Accelerated Algorithms for Graph Vector Search: Taxonomy, Empirical Study, and Research Directions [54.570944939061555]
We present a comprehensive study of GPU-accelerated graph-based vector search algorithms.<n>We establish a detailed taxonomy of GPU optimization strategies and clarify the mapping between algorithmic tasks and hardware execution units.<n>Our findings offer clear guidelines for designing scalable and robust GPU-powered approximate nearest neighbor search systems.
arXiv Detail & Related papers (2026-02-10T16:18:04Z) - FlipFlop: A Static Analysis-based Energy Optimization Framework for GPU Kernels [38.75222180281849]
FlipFlop is a framework using static code analysis to predict energy consumption and recommend optimal thread block configurations.<n>It achieves 83% accuracy in identifying optimal energy-efficient configurations, while also minimizing developer effort by reducing the optimization search space by 93.4%.<n>For multi-head attention kernels, it yields up to 79% energy savings and 106% throughput gains relative to NVIDIA's occupancy.
arXiv Detail & Related papers (2026-01-19T19:30:25Z) - GaDE -- GPU-acceleration of time-dependent Dirac Equation for exascale [0.0]
GaDE is designed to simulate the electron dynamics in atoms induced by electromagnetic fields in the relativistic regime.<n>We evaluate GaDE on the pre-exascale supercomputer LUMI, powered by AMD MI250X GPUs and Hewlett-Packard's Slingshot interconnect.
arXiv Detail & Related papers (2025-12-25T14:47:36Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - Investigating Matrix Repartitioning to Address the Over- and Undersubscription Challenge for a GPU-based CFD Solver [0.688204255655161]
Existing approaches either fully or use plugin-based GPU solvers, each facing trade-offs between performance and development effort.<n>We propose a repartitioning strategy that better balances CPU matrix assembly and GPU-based linear solves.<n>Our results show that the proposed method significantly mitigates oversubscription issues, improving solver performance and resource utilization.
arXiv Detail & Related papers (2025-10-09T17:53:12Z) - Distributed Equivariant Graph Neural Networks for Large-Scale Electronic Structure Prediction [76.62155593340763]
Equivariant Graph Neural Networks (eGNNs) trained on density-functional theory (DFT) data can potentially perform electronic structure prediction at unprecedented scales.<n>However, the graph representations required for this task tend to be densely connected.<n>We present a distributed eGNN implementation which leverages direct GPU communication and introduce a partitioning strategy of the input graph.
arXiv Detail & Related papers (2025-07-04T23:53:47Z) - MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - Advanced Techniques for High-Performance Fock Matrix Construction on GPU Clusters [0.0]
opt-UM and opt-Brc introduce significant enhancements to Hartree-Fock caculations up to $f$-type angular momentum functions.
Opt-Brc excels for smaller systems and for highly contracted triple-$zeta$ basis sets, while opt-UM is advantageous for large molecular systems.
arXiv Detail & Related papers (2024-07-31T08:49:06Z) - Optimized thread-block arrangement in a GPU implementation of a linear solver for atmospheric chemistry mechanisms [0.0]
Earth system models (ESM) demand significant hardware resources and energy consumption to solve atmospheric chemistry processes.
Recent studies have shown improved performance from running these models on GPU accelerators.
This study proposes an optimized distribution of the chemical solver's computational load on the GPU, named Block-cells.
arXiv Detail & Related papers (2024-05-27T17:12:59Z) - Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs [3.7101665559244874]
This paper presents a SYCL implementation of Multi-formedLayer Perceptrons (MLPs) for the Intel Data Center GPU Max 1550.
We show with a simple model that this results in a significant increase in arithmetic intensity, leading to improved performance, especially for inference.
arXiv Detail & Related papers (2024-03-26T11:38:39Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - ParaGraph: Weighted Graph Representation for Performance Optimization of
HPC Kernels [1.304892050913381]
We introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree.
We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region.
Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.
arXiv Detail & Related papers (2023-04-07T05:52:59Z) - EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense
Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention.
Our multi-scale linear attention achieves the global receptive field and multi-scale learning.
EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - GPU Domain Specialization via Composable On-Package Architecture [0.8240720472180706]
Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products.
We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs.
arXiv Detail & Related papers (2021-04-05T23:06:50Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.