ParaGraph: Weighted Graph Representation for Performance Optimization of
HPC Kernels
- URL: http://arxiv.org/abs/2304.03487v1
- Date: Fri, 7 Apr 2023 05:52:59 GMT
- Title: ParaGraph: Weighted Graph Representation for Performance Optimization of
HPC Kernels
- Authors: Ali TehraniJamsaz, Alok Mishra, Akash Dutta, Abid M. Malik, Barbara
Chapman, Ali Jannesari
- Abstract summary: We introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree.
We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region.
Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.
- Score: 1.304892050913381
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: GPU-based HPC clusters are attracting more scientific application developers
due to their extensive parallelism and energy efficiency. In order to achieve
portability among a variety of multi/many core architectures, a popular choice
for an application developer is to utilize directive-based parallel programming
models, such as OpenMP. However, even with OpenMP, the developer must choose
from among many strategies for exploiting a GPU or a CPU. Recently, Machine
Learning (ML) approaches have brought significant advances in the optimizations
of HPC applications. To this end, several ways have been proposed to represent
application characteristics for ML models. However, the available techniques
fail to capture features that are crucial for exposing parallelism. In this
paper, we introduce a new graph-based program representation for parallel
applications that extends the Abstract Syntax Tree to represent control and
data flow information. The originality of this work lies in the addition of new
edges exploiting the implicit ordering and parent-child relationships in ASTs,
as well as the introduction of edge weights to account for loop and condition
information. We evaluate our proposed representation by training a Graph Neural
Network (GNN) to predict the runtime of an OpenMP code region across CPUs and
GPUs. Various transformations utilizing collapse and data transfer between the
CPU and GPU are used to construct the dataset. The predicted runtime of the
model is used to determine which transformation provides the best performance.
Results show that our approach is indeed effective and has normalized RMSE as
low as 0.004 to at most 0.01 in its runtime predictions.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Implementation and Analysis of GPU Algorithms for Vecchia Approximation [0.8057006406834466]
Vecchia Approximation is widely used to reduce the computational complexity and can be calculated with embarrassingly parallel algorithms.
While multi-core software has been developed for Vecchia Approximation, software designed to run on graphics processing units ( GPU) is lacking.
We show that our new method outperforms the other two and then present it in the GpGpU R package.
arXiv Detail & Related papers (2024-07-03T01:24:44Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Performance Optimization using Multimodal Modeling and Heterogeneous GNN [1.304892050913381]
We propose a technique for tuning parallel code regions that is general enough to be adapted to multiple tasks.
In this paper, we analyze IR-based programming models to make task-specific performance optimizations.
Our experiments show that this multimodal learning based approach outperforms the state-of-the-art in all experiments.
arXiv Detail & Related papers (2023-04-25T04:27:43Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - ML-driven Hardware Cost Model for MLIR [1.2987894327817158]
We develop a machine learning-based cost model for high-level MLIR.
By considering the incoming MLIR as a text input a la NLP models we can apply well-known techniques from modern NLP research.
We show that these models can provide reasonably good estimates with low error bounds for various hardware characteristics of interest.
arXiv Detail & Related papers (2023-02-14T11:32:47Z) - Scalable Graph Convolutional Network Training on Distributed-Memory
Systems [5.169989177779801]
Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs.
Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges.
We propose a highly parallel training algorithm that scales to large processor counts.
arXiv Detail & Related papers (2022-12-09T17:51:13Z) - MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical
Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle.
Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z) - Einsum Networks: Fast and Scalable Learning of Tractable Probabilistic
Circuits [99.59941892183454]
We propose Einsum Networks (EiNets), a novel implementation design for PCs.
At their core, EiNets combine a large number of arithmetic operations in a single monolithic einsum-operation.
We show that the implementation of Expectation-Maximization (EM) can be simplified for PCs, by leveraging automatic differentiation.
arXiv Detail & Related papers (2020-04-13T23:09:15Z) - ProGraML: Graph-based Deep Learning for Program Optimization and
Analysis [16.520971531754018]
We introduce ProGraML, a graph-based program representation for machine learning.
ProGraML achieves an average 94.0 F1 score, significantly outperforming the state-of-the-art approaches.
We then apply our approach to two high-level tasks - heterogeneous device mapping and program classification - setting new state-of-the-art performance in both.
arXiv Detail & Related papers (2020-03-23T20:27:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.