Performance Tuning for GPU-Embedded Systems: Machine-Learning-based and
Analytical Model-driven Tuning Methodologies
- URL: http://arxiv.org/abs/2310.16214v1
- Date: Tue, 24 Oct 2023 22:09:03 GMT
- Title: Performance Tuning for GPU-Embedded Systems: Machine-Learning-based and
Analytical Model-driven Tuning Methodologies
- Authors: Adrian Perez Dieguez, Margarita Amor Lopez
- Abstract summary: The study introduces an analytical model-driven tuning methodology and a Machine Learning (ML)-based tuning methodology.
We evaluate the performance of the two tuning methodologies for different parallel prefix implementations of the BPLG library in an NVIDIA Jetson system.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: GPU-embedded systems have gained popularity across various domains due to
their efficient power consumption. However, in order to meet the demands of
real-time or time-consuming applications running on these systems, it is
crucial for them to be tuned to exhibit high performance. This paper addresses
the issue by developing and comparing two tuning methodologies on GPU-embedded
systems, and also provides performance insights for developers and researchers
seeking to optimize applications running on these architectures. We focus on
parallel prefix operations, such as FFT, scan primitives, and tridiagonal
system solvers, which are performance-critical components in many applications.
The study introduces an analytical model-driven tuning methodology and a
Machine Learning (ML)-based tuning methodology. We evaluate the performance of
the two tuning methodologies for different parallel prefix implementations of
the BPLG library in an NVIDIA Jetson system, and compare their performance to
the ones achieved through an exhaustive search. The findings shed light on the
best strategies for handling the open challenge of performance portability for
major computational patterns among server and embedded devices, providing
practical guidance for offline and online tuning. We also address the existing
gap in performance studies for parallel computational patterns in GPU-embedded
systems by comparing the BPLG performance against other state-of-the-art
libraries, including CUSPARSE, CUB, and CUFFT.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference [2.2231908139555734]
We propose a general performance modeling methodology and workload analysis of distributed LLM training and inference.
We validate our performance predictions with published data from literature and relevant industry vendors (e.g., NVIDIA)
arXiv Detail & Related papers (2024-07-19T19:49:05Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - ParaGraph: Weighted Graph Representation for Performance Optimization of
HPC Kernels [1.304892050913381]
We introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree.
We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region.
Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.
arXiv Detail & Related papers (2023-04-07T05:52:59Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Performance Embeddings: A Similarity-based Approach to Automatic
Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Hierarchical Roofline Performance Analysis for Deep Learning
Applications [0.06999740786886534]
This paper presents a practical methodology for collecting performance data necessary to conduct hierarchical Roofline analysis on NVIDIA GPUs.
It discusses the extension of the Empirical Roofline Toolkit for broader support of a range of data precisions and Core support and introduces a Nsight Compute based method to accurately collect application performance information.
arXiv Detail & Related papers (2020-09-11T07:16:55Z) - Optimizing Streaming Parallelism on Heterogeneous Many-Core
Architectures: A Machine Learning Based Approach [16.702537371391053]
This article presents an automatic approach to derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures.
Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration.
Compared to the single-stream version, our approach achieves a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively.
arXiv Detail & Related papers (2020-03-05T21:18:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.