Related papers: Power Constrained Autotuning using Graph Neural Networks

Power Constrained Autotuning using Graph Neural Networks

URL: http://arxiv.org/abs/2302.11467v1
Date: Wed, 22 Feb 2023 16:06:00 GMT
Title: Power Constrained Autotuning using Graph Neural Networks
Authors: Akash Dutta, Jee Choi, Ali Jannesari
Abstract summary: We propose a novel Graph Neural Network based auto-tuning approach to improve the performance, power, and energy efficiency of scientific applications on modern processors. Our approach identifies OpenMP configurations at different power constraints that yield a mean geometric performance improvement of more than $25%$ and $13%$ over the default OpenMP configuration.
Score: 1.7188280334580197
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in multi and many-core processors have led to significant improvements in the performance of scientific computing applications. However, the addition of a large number of complex cores have also increased the overall power consumption, and power has become a first-order design constraint in modern processors. While we can limit power consumption by simply applying software-based power constraints, applying them blindly will lead to non-trivial performance degradation. To address the challenge of improving the performance, power, and energy efficiency of scientific applications on modern multi-core processors, we propose a novel Graph Neural Network based auto-tuning approach that (i) optimizes runtime performance at pre-defined power constraints, and (ii) simultaneously optimizes for runtime performance and energy efficiency by minimizing the energy-delay product. The key idea behind this approach lies in modeling parallel code regions as flow-aware code graphs to capture both semantic and structural code features. We demonstrate the efficacy of our approach by conducting an extensive evaluation on $30$ benchmarks and proxy-/mini-applications with $68$ OpenMP code regions. Our approach identifies OpenMP configurations at different power constraints that yield a geometric mean performance improvement of more than $25\%$ and $13\%$ over the default OpenMP configuration on a 32-core Skylake and a $16$-core Haswell processor respectively. In addition, when we optimize for the energy-delay product, the OpenMP configurations selected by our auto-tuner demonstrate both performance improvement of $21\%$ and $11\%$ and energy reduction of $29\%$ and $18\%$ over the default OpenMP configuration at Thermal Design Power for the same Skylake and Haswell processors, respectively.

Related papers

GPU Performance Portability needs Autotuning [0.0]
LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware.<n>We make the case for combining just-in-time (JIT) compilation with comprehensive kernel parameter autotuning.<n>Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.
arXiv Detail & Related papers (2025-04-30T12:57:21Z)
HEPPO: Hardware-Efficient Proximal Policy Optimization -- A Universal Pipelined Architecture for Generalized Advantage Estimation [0.0]
HEPPO is an FPGA-based accelerator designed to optimize the Generalized Advantage Estimation stage in Proximal Policy Optimization. Key innovation is our strategic standardization technique, which combines dynamic reward standardization and block standardization for values, followed by 8-bit uniform quantization. Our single-chip solution minimizes communication latency and throughput bottlenecks, significantly boosting PPO training efficiency.
arXiv Detail & Related papers (2025-01-22T08:18:56Z)
Energy consumption of code small language models serving with runtime engines and execution providers [11.998900897003997]
Small Language Models (SLMs) offer a promising solution to reduce resource demands. Our goal is to analyze the impact of deep learning engines and execution providers on energy consumption, execution time, and computing-resource utilization.
arXiv Detail & Related papers (2024-12-19T22:44:02Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE. Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics [0.0]
This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock energies and analytic gradients using multiple Graphics Processing Units (GPUs) The algorithm is especially designed for high throughput emphab initio molecular dynamics simulations of small and medium size molecules (10-100 atoms)
arXiv Detail & Related papers (2024-07-29T00:14:10Z)
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community. We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z)
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE) MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA. Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels [1.304892050913381]
We introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree. We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region. Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.
arXiv Detail & Related papers (2023-04-07T05:52:59Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks. The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources. This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)
iELAS: An ELAS-Based Energy-Efficient Accelerator for Real-Time Stereo Matching on FPGA Platform [21.435663827158564]
We propose an energy-efficient architecture for real-time ELAS-based stereo matching on FPGA platform. Our FPGA realization achieves up to 38.4x and 3.32x frame rate improvement, and up to 27.1x and 1.13x energy efficiency improvement, respectively.
arXiv Detail & Related papers (2021-04-11T21:22:54Z)
Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers [5.4352987210173955]
This paper aims at increasing smartness in the software toolchain to exploit modern architectures in the best way. In the case of low-power, parallel embedded architectures, this means finding the configuration, for instance in terms of the number of cores, leading to minimum energy consumption. Experiments show that using machine learning models on the source code to select the best energy scaling configuration automatically is viable and has the potential to be used in the context of automatic system configuration for energy minimisation.
arXiv Detail & Related papers (2020-12-12T15:12:03Z)
Adaptive pruning-based optimization of parameterized quantum circuits [62.997667081978825]
Variisy hybrid quantum-classical algorithms are powerful tools to maximize the use of Noisy Intermediate Scale Quantum devices. We propose a strategy for such ansatze used in variational quantum algorithms, which we call "Efficient Circuit Training" (PECT) Instead of optimizing all of the ansatz parameters at once, PECT launches a sequence of variational algorithms.
arXiv Detail & Related papers (2020-10-01T18:14:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.