Related papers: Advanced Techniques for High-Performance Fock Matrix Construction on GPU Clusters

Advanced Techniques for High-Performance Fock Matrix Construction on GPU Clusters

URL: http://arxiv.org/abs/2407.21445v1
Date: Wed, 31 Jul 2024 08:49:06 GMT
Title: Advanced Techniques for High-Performance Fock Matrix Construction on GPU Clusters
Authors: Elise Palethorpe, Ryan Stocks, Giuseppe M. J. Barca,
Abstract summary: opt-UM and opt-Brc introduce significant enhancements to Hartree-Fock caculations up to $f$-type angular momentum functions. Opt-Brc excels for smaller systems and for highly contracted triple-$zeta$ basis sets, while opt-UM is advantageous for large molecular systems.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This Article presents two optimized multi-GPU algorithms for Fock matrix construction, building on the work of Ufimtsev et al. and Barca et al. The novel algorithms, opt-UM and opt-Brc, introduce significant enhancements, including improved integral screening, exploitation of sparsity and symmetry, a linear scaling exchange matrix assembly algorithm, and extended capabilities for Hartree-Fock caculations up to $f$-type angular momentum functions. Opt-Brc excels for smaller systems and for highly contracted triple-$\zeta$ basis sets, while opt-UM is advantageous for large molecular systems. Performance benchmarks on NVIDIA A100 GPUs show that our algorithms in the EXtreme-scale Electronic Structure System (EXESS), when combined, outperform all current GPU and CPU Fock build implementations in TeraChem, QUICK, GPU4PySCF, LibIntX, ORCA, and Q-Chem. The implementations were benchmarked on linear and globular systems and average speed ups across three double-$\zeta$ basis sets of 1.5$\times$, 5.2$\times$, and 8.5$\times$ were observed compared to TeraChem, GPU4PySCF, and QUICK respectively. Strong scaling analysis reveals over 91% parallel efficiency on four GPUs for opt-Brc, making it typically faster for multi-GPU execution. Single-compute-node comparisons with CPU-based software like ORCA and Q-Chem show speedups of up to 42$\times$ and 31$\times$, respectively, enhancing power efficiency by up to 18$\times$.

Related papers

CAT: A GPU-Accelerated FHE Framework with Its Application to High-Precision Private Dataset Query [0.51795041186793]
We introduce an open-source GPU-accelerated fully homomorphic encryption (FHE) framework CAT. emphCAT features a three-layer architecture: a foundation of core math, a bridge of pre-computed elements and combined operations, and an API-accessible layer of FHE operators. Based on our framework, we implement three widely used FHE schemes: CKKS, BFV, and BGV.
arXiv Detail & Related papers (2025-03-28T08:20:18Z)
GPU-accelerated Effective Hamiltonian Calculator [70.12254823574538]
We present numerical techniques inspired by Nonperturbative Analytical Diagonalization (NPAD) and the Magnus expansion for the efficient calculation of effective Hamiltonians. Our numerical techniques are available as an open-source Python package, $rm qCH_eff$.
arXiv Detail & Related papers (2024-11-15T06:33:40Z)
Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics [0.0]
This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock energies and analytic gradients using multiple Graphics Processing Units (GPUs) The algorithm is especially designed for high throughput emphab initio molecular dynamics simulations of small and medium size molecules (10-100 atoms)
arXiv Detail & Related papers (2024-07-29T00:14:10Z)
Introducing GPU-acceleration into the Python-based Simulations of Chemistry Framework [4.368931200886271]
We introduce the first version of GPU4PySCF, a module that provides GPU acceleration of methods in PySCF. Benchmark calculations show a significant speedup of two orders of magnitude with respect to the multi-threaded CPU Hartree-Fock code of PySCF.
arXiv Detail & Related papers (2024-07-12T21:50:19Z)
GPU-accelerated Auxiliary-field quantum Monte Carlo with multi-Slater determinant trial states [11.514211053741338]
We present an implementation and application of graphics processing unitaccelerated ph-AFQMC. Using multi-Slater trial states, ph-AFQMC has the potential faithfully treat strongly correlated systems. Our work significantly enhances the efficiency of MSDAFQMC calculations for large, strongly correlated molecules.
arXiv Detail & Related papers (2024-06-12T15:15:17Z)
A distributed multi-GPU ab initio density matrix renormalization group algorithm with applications to the P-cluster of nitrogenase [1.7444066202370399]
We present the first distributed multi- GPU (Graphics Processing Unit) emphab initio density matrix renormalization (DMRG) algorithm. We are able to reach an unprecedentedly large bond dimension $D=14000$ on 48 GPU. This is nearly three times larger than the bond dimensions reported in previous DMRG calculations for the same system using only CPUs.
arXiv Detail & Related papers (2023-11-06T04:01:26Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
Matching Pursuit Based Scheduling for Over-the-Air Federated Learning [67.59503935237676]
This paper develops a class of low-complexity device scheduling algorithms for over-the-air learning via the method of federated learning. Compared to the state-of-the-art proposed scheme, the proposed scheme poses a drastically lower efficiency system. The efficiency of the proposed scheme is confirmed via experiments on the CIFAR dataset.
arXiv Detail & Related papers (2022-06-14T08:14:14Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms [1.3249453757295084]
We study training algorithms for deep learning on heterogeneous CPU+GPU architectures. Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging. We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
arXiv Detail & Related papers (2020-04-19T05:21:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.