Hybrid programming-model strategies for GPU offloading of electronic
structure calculation kernels
- URL: http://arxiv.org/abs/2401.13772v1
- Date: Wed, 24 Jan 2024 19:38:01 GMT
- Title: Hybrid programming-model strategies for GPU offloading of electronic
structure calculation kernels
- Authors: Jean-Luc Fattebert, Christian F. A. Negre, Joshua Finkelstein,
Jamaludin Mohd-Yusof, Daniel Osei-Kuffuor, Michael E. Wall, Yu Zhang, Nicolas
Bock, Susan M. Mniszewski
- Abstract summary: PROGRESS is a library for electronic structure solvers.
It implements linear algebra operations for electronic structure kernels.
We describe the general strategies used for these implementations on various computer architectures.
- Score: 2.4898174182192974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To address the challenge of performance portability, and facilitate the
implementation of electronic structure solvers, we developed the Basic Matrix
Library (BML) and Parallel, Rapid O(N) and Graph-based Recursive Electronic
Structure Solver (PROGRESS) libraries. BML implements linear algebra operations
necessary for electronic structure kernels using a unified user interface for
various matrix formats (dense, sparse) and architectures (CPUs, GPUs). Focusing
on Density Functional Theory (DFT) and Tight-Binding (TB) models, PROGRESS
implements several solvers for computing the single-particle density matrix and
relies on BML. In this paper, we describe the general strategies used for these
implementations on various computer architectures, using OpenMP target
functionalities on GPUs, in conjunction with third-party libraries to handle
performance critical numerical kernels. We demonstrate the portability of this
approach and its performance on benchmark problems.
Related papers
- Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices [88.33936714942996]
We present a unifying framework that enables searching among all linear operators expressible via an Einstein summation.
We show that differences in the compute-optimal scaling laws are mostly governed by a small number of variables.
We find that Mixture-of-Experts (MoE) learns an MoE in every single linear layer of the model, including the projection in the attention blocks.
arXiv Detail & Related papers (2024-10-03T00:44:50Z) - Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - Automatic Generators for a Family of Matrix Multiplication Routines with
Apache TVM [0.20971479389679337]
We generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS.
We also leverage the Apache TVM framework to derive a complete variety of the processor-specific micro- Kernels for GEMM.
arXiv Detail & Related papers (2023-10-31T10:36:26Z) - Tackling the Matrix Multiplication Micro-kernel Generation with Exo [0.5517652814152908]
We present a step-by-step procedure for generating a dedicated micro-kernel for each new hardware.
Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions.
arXiv Detail & Related papers (2023-10-26T14:09:57Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - ParaGraph: Weighted Graph Representation for Performance Optimization of
HPC Kernels [1.304892050913381]
We introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree.
We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region.
Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.
arXiv Detail & Related papers (2023-04-07T05:52:59Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Efficient GPU implementation of randomized SVD and its applications [17.71779625877989]
Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality data compression and deep learning algorithms.
Typical solutions for matrix decompositions have complexity which significantly increases their computational cost and time.
We leverage efficient processing operations that can be run in parallel on modern Graphical Processing Units (GPUs) to reduce the computational burden of computing matrix decompositions.
arXiv Detail & Related papers (2021-10-05T07:42:41Z) - MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical
Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle.
Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z) - Towards High Performance Relativistic Electronic Structure Modelling:
The EXP-T Program Package [68.8204255655161]
We present a new implementation of the FS-RCC method designed for modern parallel computers.
The performance and scaling features of the implementation are analyzed.
The software developed allows to achieve a completely new level of accuracy for prediction of properties of atoms and molecules containing heavy and superheavy nuclei.
arXiv Detail & Related papers (2020-04-07T20:08:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.