Related papers: Classical bounds on correlation-type Bell expressions and linear prepare-and-measure witnesses: efficient computation in parallel environments such as graphics processing units

Classical bounds on correlation-type Bell expressions and linear prepare-and-measure witnesses: efficient computation in parallel environments such as graphics processing units

URL: http://arxiv.org/abs/2503.21596v1
Date: Thu, 27 Mar 2025 15:14:32 GMT
Title: Classical bounds on correlation-type Bell expressions and linear prepare-and-measure witnesses: efficient computation in parallel environments such as graphics processing units
Authors: István Márton, Erika Bene, Péter Diviánszky, Gábor Drótos,
Abstract summary: The presented program aims at speeding up the brute force computation of the $L_d$ norm of a matrix $M$ using graphics processing units (GPUs)<n>Alternatives for CPUs have also been implemented, and the algorithm is applicable to any parallel environment.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The presented program aims at speeding up the brute force computation of the $L_d$ norm of a matrix $M$ using graphics processing units (GPUs). Alternatives for CPUs have also been implemented, and the algorithm is applicable to any parallel environment. The $n\times m$ matrix $M$ has real elements which may represent coefficients of a bipartite correlation-type Bell expression or those of a linear prepare-and-measure (PM) witness. In this interpretation, the $L_1$ norm is the local bound of the given Bell expression, and the $L_d$ norm for $d\ge 2$ is the classical $d$-dimensional bound of the given PM witness, which is associated with the communication of $d$-level classical messages in the PM scenario. In both scenarios, the output is assumed to be binary. The code for GPUs is written in CUDA C and can utilize one NVIDIA GPU in a computer. To illustrate the performance of our implementation, we refer to Brierley et al. [arXiv:1609.05011] who needed approximately three weeks to compute the local bound on a Bell expression defined by a $42\times 42$ matrix on a standard desktop using a single CPU core. In contrast, our efficient implementation of the brute force algorithm allows us to reduce this to three minutes using a single NVIDIA RTX 6000 Ada graphics card on a desktop. For CPUs, the algorithm was implemented with OpenMP and MPI according to the shared and distributed memory models, respectively, and achieves a comparable speedup at a number of CPU cores around 100.

Related papers

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels [1.304892050913381]
We introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree. We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region. Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.
arXiv Detail & Related papers (2023-04-07T05:52:59Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
Scalable Differentially Private Clustering via Hierarchically Separated Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee. Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z)
A Push-Relabel Based Additive Approximation for Optimal Transport [5.111364864495785]
Exact algorithms for computing Optimal Transport can be slow. We introduce a new and very simple approach to find an $varepsilon$approximation of the OT distance. Our algorithm achieves a near-optimal execution time of $O(n2/varepsilon2)$ for computing OT distance.
arXiv Detail & Related papers (2022-03-07T21:40:14Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
Giga-scale Kernel Matrix Vector Multiplication on GPU [19.663081364196778]
Kernel matrix-vector multiplication (KMVM) is a foundational operation in machine learning and scientific computing.<n>As KMVM tends to scale quadratically in both memory and time, applications are often limited by these computational constraints.<n>We propose a novel approximation procedure coined textitFaster-Fast and Free Memory Method ($fthreem$) to address these scaling issues.
arXiv Detail & Related papers (2022-02-02T15:28:15Z)
Simulation of quantum physics with Tensor Processing Units: brute-force computation of ground states and time evolution [0.3232625980782302]
Processing Units (TPUs) were developed by Google exclusively to support large-scale machine learning tasks. In this paper we repurpose TPUs for the challenging problem of simulating quantum spin systems. With a TPU v3 pod, with 2048 cores, we simulate wavefunctions $|Psirangle$ of up to $N=38$ qubits.
arXiv Detail & Related papers (2021-11-19T22:41:04Z)
VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator. textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z)
Quantum algorithms for spectral sums [50.045011844765185]
We propose new quantum algorithms for estimating spectral sums of positive semi-definite (PSD) matrices. We show how the algorithms and techniques used in this work can be applied to three problems in spectral graph theory.
arXiv Detail & Related papers (2020-11-12T16:29:45Z)
Systolic Computing on GPUs for Productive Performance [2.8064596842326575]
We propose a language and compiler to productively build high-performance systolic arrays that run on GPUs. A programmer it' specifies a projection of a dataflow compute onto a linear systolic array, while leaving the detailed implementation of the projection to a compiler. The compiler implements the specified projection and maps the linear systolic array to the SIMD execution units and vector registers of GPUs.
arXiv Detail & Related papers (2020-10-29T18:49:54Z)
MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle. Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.