Searching CUDA code autotuning spaces with hardware performance
counters: data from benchmarks running on various GPU architectures
- URL: http://arxiv.org/abs/2102.05299v1
- Date: Wed, 10 Feb 2021 07:51:09 GMT
- Title: Searching CUDA code autotuning spaces with hardware performance
counters: data from benchmarks running on various GPU architectures
- Authors: Ji\v{r}\'i Filipovi\v{c} and Jana Hozzov\'a and Amin Nezarat and
Jaroslav O\v{l}ha and Filip Petrovi\v{c}
- Abstract summary: We develop benchmarks that take into account performance-relevant source-code parameters and reach near peak-performance on various GPU architectures.
With our framework Kernel Tuning Toolkit, we measured times and hardware performance counters on several GPU for the complete tuning spaces of five benchmarks.
We describe the scripts we used for robust evaluation of our searcher and comparison to others in detail.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We have developed several autotuning benchmarks in CUDA that take into
account performance-relevant source-code parameters and reach near
peak-performance on various GPU architectures. We have used them during the
development and evaluation of a novel search method for tuning space proposed
in [1]. With our framework Kernel Tuning Toolkit, freely available at Github,
we measured computation times and hardware performance counters on several GPUs
for the complete tuning spaces of five benchmarks. These data, which we provide
here, might benefit research of search algorithms for the tuning spaces of GPU
codes or research of relation between applied code optimization, hardware
performance counters, and GPU kernels' performance.
Moreover, we describe the scripts we used for robust evaluation of our
searcher and comparison to others in detail. In particular, the script that
simulates the tuning, i.e., replaces time-demanding compiling and executing the
tuned kernels with a quick reading of the computation time from our measured
data, makes it possible to inspect the convergence of tuning search over a
large number of experiments. These scripts, freely available with our other
codes, make it easier to experiment with search algorithms and compare them in
a robust way.
During our research, we generated models for predicting values of performance
counters from values of tuning parameters of our benchmarks. Here, we provide
the models themselves and describe the scripts we implemented for their
training. These data might benefit researchers who want to reproduce or build
on our research.
Related papers
- Implementation and Analysis of GPU Algorithms for Vecchia Approximation [0.8057006406834466]
Vecchia Approximation is widely used to reduce the computational complexity and can be calculated with embarrassingly parallel algorithms.
While multi-core software has been developed for Vecchia Approximation, software designed to run on graphics processing units ( GPU) is lacking.
We show that our new method outperforms the other two and then present it in the GpGpU R package.
arXiv Detail & Related papers (2024-07-03T01:24:44Z) - SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation [0.0]
Large language models (LLMs) have become a significant workload since their appearance.
They are also computationally expensive as they have billions of parameters and are trained with massive amounts of data.
Recent works have developed dedicated kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible.
arXiv Detail & Related papers (2024-03-25T15:26:50Z) - Performance Tuning for GPU-Embedded Systems: Machine-Learning-based and
Analytical Model-driven Tuning Methodologies [0.0]
The study introduces an analytical model-driven tuning methodology and a Machine Learning (ML)-based tuning methodology.
We evaluate the performance of the two tuning methodologies for different parallel prefix implementations of the BPLG library in an NVIDIA Jetson system.
arXiv Detail & Related papers (2023-10-24T22:09:03Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Performance Embeddings: A Similarity-based Approach to Automatic
Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - Using hardware performance counters to speed up autotuning convergence
on GPUs [0.0]
We introduce a novel method for searching tuning spaces.
The method takes advantage of collecting hardware performance counters during empirical tuning.
We experimentally demonstrate that our method can speed up autotuning when an application needs to be ported to different hardware or when it needs to process data with different characteristics.
arXiv Detail & Related papers (2021-02-10T07:42:39Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.