Related papers: Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance

Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance

URL: http://arxiv.org/abs/2511.14664v1
Date: Tue, 18 Nov 2025 17:04:28 GMT
Title: Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance
Authors: W. Michael Brown, Anurag Ramesh, Thomas Lubinski, Thien Nguyen, David E. Bernal Neira,
Abstract summary: We present the introduction of MPI into the QED-C Application-Oriented Benchmarks to facilitate benchmarking on HPC systems.<n>We benchmark using a variety of interconnect paths, including the recent NVIDIA Grace Blackwell NVL72 architecture.<n>We show that while improvements to GPU architecture have led to speedups of over 4.5X, advances in interconnect performance have had a larger impact with over 16X performance improvements in time to solution.
Score: 0.7340017786387767
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As is intrinsic to the fundamental goal of quantum computing, classical simulation of quantum algorithms is notoriously demanding in resource requirements. Nonetheless, simulation is critical to the success of the field and a requirement for algorithm development and validation, as well as hardware design. GPU-acceleration has become standard practice for simulation, and due to the exponential scaling inherent in classical methods, multi-GPU simulation can be required to achieve representative system sizes. In this case, inter-GPU communications can bottleneck performance. In this work, we present the introduction of MPI into the QED-C Application-Oriented Benchmarks to facilitate benchmarking on HPC systems. We review the advances in interconnect technology and the APIs for multi-GPU communication. We benchmark using a variety of interconnect paths, including the recent NVIDIA Grace Blackwell NVL72 architecture that represents the first product to expand high-bandwidth GPU-specialized interconnects across multiple nodes. We show that while improvements to GPU architecture have led to speedups of over 4.5X across the last few generations of GPUs, advances in interconnect performance have had a larger impact with over 16X performance improvements in time to solution for multi-GPU simulations.

Related papers

GPU-Accelerated Algorithms for Graph Vector Search: Taxonomy, Empirical Study, and Research Directions [54.570944939061555]
We present a comprehensive study of GPU-accelerated graph-based vector search algorithms.<n>We establish a detailed taxonomy of GPU optimization strategies and clarify the mapping between algorithmic tasks and hardware execution units.<n>Our findings offer clear guidelines for designing scalable and robust GPU-powered approximate nearest neighbor search systems.
arXiv Detail & Related papers (2026-02-10T16:18:04Z)
Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention [63.69228529380251]
Spava is a sequence-parallel framework with optimized attention for long-video inference.<n>Spava delivers speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss.
arXiv Detail & Related papers (2026-01-29T09:23:13Z)
Scaling Behaviors of Evolutionary Algorithms on GPUs: When Does Parallelism Pay Off? [43.96509049196842]
Evolutionary algorithms (EAs) are increasingly implemented on graphics processing units (GPUs) to leverage parallel processing capabilities for enhanced efficiency.<n>We investigate how GPU parallelism alters the behavior of EAs beyond simple acceleration metrics.<n>Our results reveal that the impact of GPU acceleration is highly heterogeneous and depends strongly on algorithmic structure.
arXiv Detail & Related papers (2026-01-26T12:55:21Z)
GaDE -- GPU-acceleration of time-dependent Dirac Equation for exascale [0.0]
GaDE is designed to simulate the electron dynamics in atoms induced by electromagnetic fields in the relativistic regime.<n>We evaluate GaDE on the pre-exascale supercomputer LUMI, powered by AMD MI250X GPUs and Hewlett-Packard's Slingshot interconnect.
arXiv Detail & Related papers (2025-12-25T14:47:36Z)
ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels [40.94392896555992]
Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical bandwidth across workloads and new accelerators.<n>Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can guide the optimal optimal performance of workloads.<n>PKKittens (PK) kernels achieves up to $2.33 times times parallel workloads.
arXiv Detail & Related papers (2025-11-17T21:48:33Z)
Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z)
Q-GEAR: Improving quantum simulation framework [0.28402080392117757]
We introduce Q-Gear, a software framework that transforms Qiskit quantum circuits into Cuda-Q kernels.<n>Q-Gear accelerates both CPU and GPU based simulations by respectively two orders of magnitude and ten times with minimal coding effort.
arXiv Detail & Related papers (2025-04-04T22:17:51Z)
Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics [0.0]
This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock energies and analytic gradients using multiple Graphics Processing Units (GPUs) The algorithm is especially designed for high throughput emphab initio molecular dynamics simulations of small and medium size molecules (10-100 atoms)
arXiv Detail & Related papers (2024-07-29T00:14:10Z)
Hybrid quantum programming with PennyLane Lightning on HPC platforms [0.0]
PennyLane's Lightning suite is a collection of high-performance state-vector simulators targeting CPU, GPU, and HPC-native architectures and workloads. Quantum applications such as QAOA, VQE, and synthetic workloads are implemented to demonstrate the supported classical computing architectures.
arXiv Detail & Related papers (2024-03-04T22:01:03Z)
Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures. We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
GPU Domain Specialization via Composable On-Package Architecture [0.8240720472180706]
Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth, 32x larger on-package cache, 2.3x higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs.
arXiv Detail & Related papers (2021-04-05T23:06:50Z)
The Architectural Implications of Distributed Reinforcement Learning on CPU-GPU Systems [45.479582612113205]
We show how to improve the performance and power efficiency of RL training on CPU-GPU systems. We quantify the overall hardware utilization on a state-of-the-art distributed RL training framework. We also introduce a new system design metric, CPU/GPU ratio, and show how to find the optimal balance between CPU and GPU resources.
arXiv Detail & Related papers (2020-12-08T04:50:05Z)
Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO [46.20949184826173]
This work focuses on the applicability of efficient low-level, GPU hardware-specific instructions to improve on existing computer vision algorithms. Especially non-maxima suppression and the subsequent feature selection are prominent contributors to the overall image processing latency.
arXiv Detail & Related papers (2020-03-30T14:16:23Z)
Efficient Video Semantic Segmentation with Labels Propagation and Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach. We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next. On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.