Power Constrained Autotuning using Graph Neural Networks
- URL: http://arxiv.org/abs/2302.11467v1
- Date: Wed, 22 Feb 2023 16:06:00 GMT
- Title: Power Constrained Autotuning using Graph Neural Networks
- Authors: Akash Dutta, Jee Choi, Ali Jannesari
- Abstract summary: We propose a novel Graph Neural Network based auto-tuning approach to improve the performance, power, and energy efficiency of scientific applications on modern processors.
Our approach identifies OpenMP configurations at different power constraints that yield a mean geometric performance improvement of more than $25%$ and $13%$ over the default OpenMP configuration.
- Score: 1.7188280334580197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in multi and many-core processors have led to significant
improvements in the performance of scientific computing applications. However,
the addition of a large number of complex cores have also increased the overall
power consumption, and power has become a first-order design constraint in
modern processors. While we can limit power consumption by simply applying
software-based power constraints, applying them blindly will lead to
non-trivial performance degradation. To address the challenge of improving the
performance, power, and energy efficiency of scientific applications on modern
multi-core processors, we propose a novel Graph Neural Network based
auto-tuning approach that (i) optimizes runtime performance at pre-defined
power constraints, and (ii) simultaneously optimizes for runtime performance
and energy efficiency by minimizing the energy-delay product. The key idea
behind this approach lies in modeling parallel code regions as flow-aware code
graphs to capture both semantic and structural code features. We demonstrate
the efficacy of our approach by conducting an extensive evaluation on $30$
benchmarks and proxy-/mini-applications with $68$ OpenMP code regions. Our
approach identifies OpenMP configurations at different power constraints that
yield a geometric mean performance improvement of more than $25\%$ and $13\%$
over the default OpenMP configuration on a 32-core Skylake and a $16$-core
Haswell processor respectively. In addition, when we optimize for the
energy-delay product, the OpenMP configurations selected by our auto-tuner
demonstrate both performance improvement of $21\%$ and $11\%$ and energy
reduction of $29\%$ and $18\%$ over the default OpenMP configuration at Thermal
Design Power for the same Skylake and Haswell processors, respectively.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics [0.0]
This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock energies and analytic gradients using multiple Graphics Processing Units (GPUs)
The algorithm is especially designed for high throughput emphab initio molecular dynamics simulations of small and medium size molecules (10-100 atoms)
arXiv Detail & Related papers (2024-07-29T00:14:10Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - ParaGraph: Weighted Graph Representation for Performance Optimization of
HPC Kernels [1.304892050913381]
We introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree.
We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region.
Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.
arXiv Detail & Related papers (2023-04-07T05:52:59Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - iELAS: An ELAS-Based Energy-Efficient Accelerator for Real-Time Stereo
Matching on FPGA Platform [21.435663827158564]
We propose an energy-efficient architecture for real-time ELAS-based stereo matching on FPGA platform.
Our FPGA realization achieves up to 38.4x and 3.32x frame rate improvement, and up to 27.1x and 1.13x energy efficiency improvement, respectively.
arXiv Detail & Related papers (2021-04-11T21:22:54Z) - Source Code Classification for Energy Efficiency in Parallel Ultra
Low-Power Microcontrollers [5.4352987210173955]
This paper aims at increasing smartness in the software toolchain to exploit modern architectures in the best way.
In the case of low-power, parallel embedded architectures, this means finding the configuration, for instance in terms of the number of cores, leading to minimum energy consumption.
Experiments show that using machine learning models on the source code to select the best energy scaling configuration automatically is viable and has the potential to be used in the context of automatic system configuration for energy minimisation.
arXiv Detail & Related papers (2020-12-12T15:12:03Z) - Adaptive pruning-based optimization of parameterized quantum circuits [62.997667081978825]
Variisy hybrid quantum-classical algorithms are powerful tools to maximize the use of Noisy Intermediate Scale Quantum devices.
We propose a strategy for such ansatze used in variational quantum algorithms, which we call "Efficient Circuit Training" (PECT)
Instead of optimizing all of the ansatz parameters at once, PECT launches a sequence of variational algorithms.
arXiv Detail & Related papers (2020-10-01T18:14:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.