Machine Learning-Driven Adaptive OpenMP For Portable Performance on
Heterogeneous Systems
- URL: http://arxiv.org/abs/2303.08873v1
- Date: Wed, 15 Mar 2023 18:37:18 GMT
- Title: Machine Learning-Driven Adaptive OpenMP For Portable Performance on
Heterogeneous Systems
- Authors: Giorgis Georgakoudis, Konstantinos Parasyris, Chunhua Liao, David
Beckingsale, Todd Gamblin, Bronis de Supinski
- Abstract summary: Adapting a program to a new heterogeneous platform is laborious and requires developers to manually explore a vast space of execution parameters.
This paper proposes extensions to OpenMP for autonomous, machine learning-driven adaptation.
Our solution includes a set of novel language constructs, compiler transformations, and runtime support.
- Score: 1.885335997132172
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Heterogeneity has become a mainstream architecture design choice for building
High Performance Computing systems. However, heterogeneity poses significant
challenges for achieving performance portability of execution. Adapting a
program to a new heterogeneous platform is laborious and requires developers to
manually explore a vast space of execution parameters. To address those
challenges, this paper proposes new extensions to OpenMP for autonomous,
machine learning-driven adaptation.
Our solution includes a set of novel language constructs, compiler
transformations, and runtime support. We propose a producer-consumer pattern to
flexibly define multiple, different variants of OpenMP code regions to enable
adaptation. Those regions are transparently profiled at runtime to autonomously
learn optimizing machine learning models that dynamically select the fastest
variant. Our approach significantly reduces users' efforts of programming
adaptive applications on heterogeneous architectures by leveraging machine
learning techniques and code generation capabilities of OpenMP compilation.
Using a complete reference implementation in Clang/LLVM we evaluate three
use-cases of adaptive CPU-GPU execution. Experiments with HPC proxy
applications and benchmarks demonstrate that the proposed adaptive OpenMP
extensions automatically choose the best performing code variants for various
adaptation possibilities, in several different heterogeneous platforms of CPUs
and GPUs.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - A Method for Efficient Heterogeneous Parallel Compilation: A Cryptography Case Study [8.06660833012594]
This paper introduces a novel MLIR-based dialect, named hyper, designed to optimize data management and parallel computation across diverse hardware architectures.
We present HETOCompiler, a cryptography-focused compiler prototype that implements multiple hash algorithms and enables their execution on heterogeneous systems.
arXiv Detail & Related papers (2024-07-12T15:12:51Z) - PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler [1.6673953344957533]
We introduce a new polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations.
PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore deep learning compiler.
arXiv Detail & Related papers (2024-01-12T16:11:27Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Performance Optimization using Multimodal Modeling and Heterogeneous GNN [1.304892050913381]
We propose a technique for tuning parallel code regions that is general enough to be adapted to multiple tasks.
In this paper, we analyze IR-based programming models to make task-specific performance optimizations.
Our experiments show that this multimodal learning based approach outperforms the state-of-the-art in all experiments.
arXiv Detail & Related papers (2023-04-25T04:27:43Z) - ParaGraph: Weighted Graph Representation for Performance Optimization of
HPC Kernels [1.304892050913381]
We introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree.
We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region.
Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.
arXiv Detail & Related papers (2023-04-07T05:52:59Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Towards making the most of NLP-based device mapping optimization for
OpenCL kernels [5.6596607119831575]
We extend the work of Cummins et al., namely Deeptune, that tackles the problem of optimal device selection ( CPU or GPU) for accelerated OpenCL kernels.
We propose four different models that provide enhanced contextual information of source codes.
Experimental results show that our proposed methodology surpasses that of Cummins et al. work, providing up to 4% improvement in prediction accuracy.
arXiv Detail & Related papers (2022-08-30T10:20:55Z) - Learning to Superoptimize Real-world Programs [79.4140991035247]
We propose a framework to learn to superoptimize real-world programs by using neural sequence-to-sequence models.
We introduce the Big Assembly benchmark, a dataset consisting of over 25K real-world functions mined from open-source projects in x86-64 assembly.
arXiv Detail & Related papers (2021-09-28T05:33:21Z) - Enabling Retargetable Optimizing Compilers for Quantum Accelerators via
a Multi-Level Intermediate Representation [78.8942067357231]
We present a multi-level quantum-classical intermediate representation (IR) that enables an optimizing, retargetable, ahead-of-time compiler.
We support the entire gate-based OpenQASM 3 language and provide custom extensions for common quantum programming patterns and improved syntax.
Our work results in compile times that are 1000x faster than standard Pythonic approaches, and 5-10x faster than comparative standalone quantum language compilers.
arXiv Detail & Related papers (2021-09-01T17:29:47Z) - A Reinforcement Learning Environment for Polyhedral Optimizations [68.8204255655161]
We propose a shape-agnostic formulation for the space of legal transformations in the polyhedral model as a Markov Decision Process (MDP)
Instead of using transformations, the formulation is based on an abstract space of possible schedules.
Our generic MDP formulation enables using reinforcement learning to learn optimization policies over a wide range of loops.
arXiv Detail & Related papers (2021-04-28T12:41:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.