Related papers: Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures

Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures

URL: http://arxiv.org/abs/2102.05299v1
Date: Wed, 10 Feb 2021 07:51:09 GMT
Title: Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures
Authors: Ji\v{r}\'i Filipovi\v{c} and Jana Hozzov\'a and Amin Nezarat and Jaroslav O\v{l}ha and Filip Petrovi\v{c}
Abstract summary: We develop benchmarks that take into account performance-relevant source-code parameters and reach near peak-performance on various GPU architectures. With our framework Kernel Tuning Toolkit, we measured times and hardware performance counters on several GPU for the complete tuning spaces of five benchmarks. We describe the scripts we used for robust evaluation of our searcher and comparison to others in detail.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We have developed several autotuning benchmarks in CUDA that take into account performance-relevant source-code parameters and reach near peak-performance on various GPU architectures. We have used them during the development and evaluation of a novel search method for tuning space proposed in [1]. With our framework Kernel Tuning Toolkit, freely available at Github, we measured computation times and hardware performance counters on several GPUs for the complete tuning spaces of five benchmarks. These data, which we provide here, might benefit research of search algorithms for the tuning spaces of GPU codes or research of relation between applied code optimization, hardware performance counters, and GPU kernels' performance. Moreover, we describe the scripts we used for robust evaluation of our searcher and comparison to others in detail. In particular, the script that simulates the tuning, i.e., replaces time-demanding compiling and executing the tuned kernels with a quick reading of the computation time from our measured data, makes it possible to inspect the convergence of tuning search over a large number of experiments. These scripts, freely available with our other codes, make it easier to experiment with search algorithms and compare them in a robust way. During our research, we generated models for predicting values of performance counters from values of tuning parameters of our benchmarks. Here, we provide the models themselves and describe the scripts we implemented for their training. These data might benefit researchers who want to reproduce or build on our research.

Related papers

NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
Implementation and Analysis of GPU Algorithms for Vecchia Approximation [0.8057006406834466]
Vecchia Approximation is widely used to reduce the computational complexity and can be calculated with embarrassingly parallel algorithms. While multi-core software has been developed for Vecchia Approximation, software designed to run on graphics processing units ( GPU) is lacking. We show that our new method outperforms the other two and then present it in the GpGpU R package.
arXiv Detail & Related papers (2024-07-03T01:24:44Z)
SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation [0.0]
Large language models (LLMs) have become a significant workload since their appearance. They are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Recent works have developed dedicated kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible.
arXiv Detail & Related papers (2024-03-25T15:26:50Z)
Performance Tuning for GPU-Embedded Systems: Machine-Learning-based and Analytical Model-driven Tuning Methodologies [0.0]
The study introduces an analytical model-driven tuning methodology and a Machine Learning (ML)-based tuning methodology. We evaluate the performance of the two tuning methodologies for different parallel prefix implementations of the BPLG library in an NVIDIA Jetson system.
arXiv Detail & Related papers (2023-10-24T22:09:03Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z)
PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time. PARTIME starts processing each data sample at the time in which it becomes available from the stream. Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
Using hardware performance counters to speed up autotuning convergence on GPUs [0.0]
We introduce a novel method for searching tuning spaces. The method takes advantage of collecting hardware performance counters during empirical tuning. We experimentally demonstrate that our method can speed up autotuning when an application needs to be ported to different hardware or when it needs to process data with different characteristics.
arXiv Detail & Related papers (2021-02-10T07:42:39Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.