An approach to performance portability through generic programming
- URL: http://arxiv.org/abs/2311.05038v1
- Date: Wed, 8 Nov 2023 21:54:43 GMT
- Title: An approach to performance portability through generic programming
- Authors: Andreas Hadjigeorgiou, Christodoulos Stylianou, Michele Weiland, Dirk
Jacob Verschuur, Jacob Finkenrath
- Abstract summary: This work describes a design approach that allows the integration of low-level and verbose programming tools into high-level generic algorithms based on template meta-programming in C++.
That allows scientific software to be maintainable and efficient in a period of diversifying hardware in HPC.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The expanding hardware diversity in high performance computing adds enormous
complexity to scientific software development. Developers who aim to write
maintainable software have two options: 1) To use a so-called data locality
abstraction that handles portability internally, thereby,
performance-productivity becomes a trade off. Such abstractions usually come in
the form of libraries, domain-specific languages, and run-time systems. 2) To
use generic programming where performance, productivity and portability are
subject to software design. In the direction of the second, this work describes
a design approach that allows the integration of low-level and verbose
programming tools into high-level generic algorithms based on template
meta-programming in C++. This enables the development of performance-portable
applications targeting host-device computer architectures, such as CPUs and
GPUs. With a suitable design in place, the extensibility of generic algorithms
to new hardware becomes a well defined procedure that can be developed in
isolation from other parts of the code. That allows scientific software to be
maintainable and efficient in a period of diversifying hardware in HPC. As
proof of concept, a finite-difference modelling algorithm for the acoustic wave
equation is developed and benchmarked using roofline model analysis on Intel
Xeon Gold 6248 CPU, Nvidia Tesla V100 GPU, and AMD MI100 GPU.
Related papers
- Specx: a C++ task-based runtime system for heterogeneous distributed architectures [0.0]
Specx is a task-based runtime system written in modern C++.
We present Specx, a task-based runtime system written in modern C++.
arXiv Detail & Related papers (2023-08-30T11:41:30Z) - SEER: Super-Optimization Explorer for HLS using E-graph Rewriting with
MLIR [0.3124884279860061]
High-level synthesis (HLS) is a process that automatically translates a software program in a high-level language into a low-level hardware description.
We propose a super-optimization approach for HLS that automatically rewrites an arbitrary software program into HLS efficient code.
We show that SEER achieves up to 38x the performance within 1.4x the area of the original program.
arXiv Detail & Related papers (2023-08-15T09:05:27Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - QParallel: Explicit Parallelism for Programming Quantum Computers [62.10004571940546]
We present a language extension for parallel quantum programming.
QParallel removes ambiguities concerning parallelism in current quantum programming languages.
We introduce a tool that guides programmers in the placement of parallel regions by identifying the subroutines that profit most from parallelization.
arXiv Detail & Related papers (2022-10-07T16:35:16Z) - StreamBlocks: A compiler for heterogeneous dataflow computing (technical
report) [1.5293427903448022]
This work introduces StreamBlocks, an open-source compiler and runtime that uses the CAL dataflow programming language to partition computations across platforms.
StreamBlocks supports exploring the design space with a profile-guided tool that helps identify the best hardware-software partitions.
arXiv Detail & Related papers (2021-07-20T08:46:47Z) - Extending C++ for Heterogeneous Quantum-Classical Computing [56.782064931823015]
qcor is a language extension to C++ and compiler implementation that enables heterogeneous quantum-classical programming, compilation, and execution in a single-source context.
Our work provides a first-of-its-kind C++ compiler enabling high-level quantum kernel (function) expression in a quantum-language manner.
arXiv Detail & Related papers (2020-10-08T12:49:07Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z) - MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical
Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle.
Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z) - Towards High Performance, Portability, and Productivity: Lightweight
Augmented Neural Networks for Performance Prediction [0.0]
We propose lightweight augmented neural networks for arbitrary combinations of kernel-variant- hardware.
We are able to obtain a low MAPE of 3%, significantly outperforming traditional feed-forward neural networks.
Our variant-selection approach can be used in Halide implementations to obtain up to 1.7x speedup over Halide's auto-scheduler.
arXiv Detail & Related papers (2020-03-17T02:19:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.