Related papers: GaDE -- GPU-acceleration of time-dependent Dirac Equation for exascale

GaDE -- GPU-acceleration of time-dependent Dirac Equation for exascale

URL: http://arxiv.org/abs/2512.21697v1
Date: Thu, 25 Dec 2025 14:47:36 GMT
Title: GaDE -- GPU-acceleration of time-dependent Dirac Equation for exascale
Authors: Johanne Elise Vembe, Marcin Krotkiewski, Magnar Bjørgve, Morten Førre, Hicham Agueny,
Abstract summary: GaDE is designed to simulate the electron dynamics in atoms induced by electromagnetic fields in the relativistic regime.<n>We evaluate GaDE on the pre-exascale supercomputer LUMI, powered by AMD MI250X GPUs and Hewlett-Packard's Slingshot interconnect.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern heterogeneous high-performance computing (HPC) systems powered by advanced graphics processing unit (GPU) architectures enable accelerating computing with unprecedented performance and scalability. Here, we present a GPU-accelerated solver for the three-dimensional (3D) time-dependent Dirac equation optimized for distributed HPC systems. The solver named GaDE is designed to simulate the electron dynamics in atoms induced by electromagnetic fields in the relativistic regime. It combines MPI with CUDA/HIP to target both NVIDIA and AMD GPU architectures. We discuss our implementation strategies in which most of the computations are carried out on GPUs, taking advantage of the GPU-aware MPI feature to optimize communication performance. We evaluate GaDE on the pre-exascale supercomputer LUMI, powered by AMD MI250X GPUs and HPE's Slingshot interconnect. Single-GPU performance on NVIDIA A100, GH200, and AMD MI250X shows comparable performance between A100 and MI250X in compute and memory bandwidth, with GH200 delivering higher performance. Weak scaling on LUMI demonstrates exceptional scalability, achieving 85% parallel efficiency across 2048 GPUs, while strong scaling delivers a 16x speedup on 32 GPUs - 50% efficiency for a communication-intensive, time-dependent Dirac equation solver. These results demonstrate GaDE's high scalability, making it suitable for exascale systems and enabling predictive simulations for ultra-intense laser experiments probing relativistic quantum effects.

Related papers

GPU-Accelerated Algorithms for Graph Vector Search: Taxonomy, Empirical Study, and Research Directions [54.570944939061555]
We present a comprehensive study of GPU-accelerated graph-based vector search algorithms.<n>We establish a detailed taxonomy of GPU optimization strategies and clarify the mapping between algorithmic tasks and hardware execution units.<n>Our findings offer clear guidelines for designing scalable and robust GPU-powered approximate nearest neighbor search systems.
arXiv Detail & Related papers (2026-02-10T16:18:04Z)
Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance [0.7340017786387767]
We present the introduction of MPI into the QED-C Application-Oriented Benchmarks to facilitate benchmarking on HPC systems.<n>We benchmark using a variety of interconnect paths, including the recent NVIDIA Grace Blackwell NVL72 architecture.<n>We show that while improvements to GPU architecture have led to speedups of over 4.5X, advances in interconnect performance have had a larger impact with over 16X performance improvements in time to solution.
arXiv Detail & Related papers (2025-11-18T17:04:28Z)
Distributed Equivariant Graph Neural Networks for Large-Scale Electronic Structure Prediction [76.62155593340763]
Equivariant Graph Neural Networks (eGNNs) trained on density-functional theory (DFT) data can potentially perform electronic structure prediction at unprecedented scales.<n>However, the graph representations required for this task tend to be densely connected.<n>We present a distributed eGNN implementation which leverages direct GPU communication and introduce a partitioning strategy of the input graph.
arXiv Detail & Related papers (2025-07-04T23:53:47Z)
Advanced Techniques for High-Performance Fock Matrix Construction on GPU Clusters [0.0]
opt-UM and opt-Brc introduce significant enhancements to Hartree-Fock caculations up to $f$-type angular momentum functions. Opt-Brc excels for smaller systems and for highly contracted triple-$zeta$ basis sets, while opt-UM is advantageous for large molecular systems.
arXiv Detail & Related papers (2024-07-31T08:49:06Z)
Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics [0.0]
This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock energies and analytic gradients using multiple Graphics Processing Units (GPUs) The algorithm is especially designed for high throughput emphab initio molecular dynamics simulations of small and medium size molecules (10-100 atoms)
arXiv Detail & Related papers (2024-07-29T00:14:10Z)
Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs [3.7101665559244874]
This paper presents a SYCL implementation of Multi-formedLayer Perceptrons (MLPs) for the Intel Data Center GPU Max 1550. We show with a simple model that this results in a significant increase in arithmetic intensity, leading to improved performance, especially for inference.
arXiv Detail & Related papers (2024-03-26T11:38:39Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters. It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions. We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)
GPU Domain Specialization via Composable On-Package Architecture [0.8240720472180706]
Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs.
arXiv Detail & Related papers (2021-04-05T23:06:50Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.