Related papers: Using hardware performance counters to speed up autotuning convergence on GPUs

Using hardware performance counters to speed up autotuning convergence on GPUs

URL: http://arxiv.org/abs/2102.05297v1
Date: Wed, 10 Feb 2021 07:42:39 GMT
Title: Using hardware performance counters to speed up autotuning convergence on GPUs
Authors: Ji\v{r}\'i Filipovi\v{c} and Jana Hozzov\'a and Amin Nezarat and Jaroslav O\v{l}ha and Filip Petrovi\v{c}
Abstract summary: We introduce a novel method for searching tuning spaces. The method takes advantage of collecting hardware performance counters during empirical tuning. We experimentally demonstrate that our method can speed up autotuning when an application needs to be ported to different hardware or when it needs to process data with different characteristics.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of hardware and specific data characteristics can be extremely challenging. The autotuning of performance-relevant source-code parameters allows for automatic optimization of applications and keeps their performance portable. Although the autotuning process typically results in code speed-up, searching the tuning space can bring unacceptable overhead if (i) the tuning space is vast and full of poorly-performing implementations, or (ii) the autotuning process has to be repeated frequently because of changes in processed data or migration to different hardware. In this paper, we introduce a novel method for searching tuning spaces. The method takes advantage of collecting hardware performance counters (also known as profiling counters) during empirical tuning. Those counters are used to navigate the searching process towards faster implementations. The method requires the tuning space to be sampled on any GPU. It builds a problem-specific model, which can be used during autotuning on various, even previously unseen inputs or GPUs. Using a set of five benchmarks, we experimentally demonstrate that our method can speed up autotuning when an application needs to be ported to different hardware or when it needs to process data with different characteristics. We also compared our method to state of the art and show that our method is superior in terms of the number of searching steps and typically outperforms other searches in terms of convergence time.

Related papers

Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking. DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget. Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z)
Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z)
HPC Storage Service Autotuning Using Variational-Autoencoder-Guided Asynchronous Bayesian Optimization [3.153934519625761]
We develop a novel variational-autoencoder-guided asynchronous Bayesian optimization method to tune HPC storage service parameters. We implement our approach within the DeepHyper open-source framework, and apply it to the autotuning of a high-energy physics workflow on Argonne's Theta supercomputer. Our approach is on par with state-of-the-art autotuning frameworks in speed and outperforms them in resource utilization and parallelization capabilities.
arXiv Detail & Related papers (2022-10-03T10:12:57Z)
HW-Aware Initialization of DNN Auto-Tuning to Improve Exploration Time and Robustness [1.165213554548421]
This work evaluates how invalid configurations affect the auto-tuning process and its underlying performance prediction model for the VTA hardware. A validity-driven method for AutoTVM is developed, only requiring 41.6% of the necessary hardware measurements to find the best solution.
arXiv Detail & Related papers (2022-05-31T07:16:14Z)
MAPLE: Microprocessor A Priori for Latency Estimation [81.91509153539566]
Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption. Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process. We propose Microprocessor A Priori for Estimation Estimation MAPLE that does not rely on transfer learning or domain adaptation.
arXiv Detail & Related papers (2021-11-30T03:52:15Z)
AutoTune: Controller Tuning for High-Speed Flight [117.69289575486246]
How sensitive are controllers to tuning when tracking high-speed maneuvers? What algorithms can we use to automatically tune them? We propose AutoTune, a sampling-based tuning algorithm specifically tailored to high-speed flight.
arXiv Detail & Related papers (2021-03-19T09:12:51Z)
Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures [0.0]
We develop benchmarks that take into account performance-relevant source-code parameters and reach near peak-performance on various GPU architectures. With our framework Kernel Tuning Toolkit, we measured times and hardware performance counters on several GPU for the complete tuning spaces of five benchmarks. We describe the scripts we used for robust evaluation of our searcher and comparison to others in detail.
arXiv Detail & Related papers (2021-02-10T07:51:09Z)
Autotuning PolyBench Benchmarks with LLVM Clang/Polly Loop Optimization Pragmas Using Bayesian Optimization [0.6583716093321499]
An autotuning is an approach that explores a search space of possible implementations/configurations of a kernel or an application. We develop an autotuning framework that leverages Bayesian optimization to explore the parameter space search.
arXiv Detail & Related papers (2020-10-15T22:09:42Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
Latency-Aware Differentiable Neural Architecture Search [113.35689580508343]
Differentiable neural architecture search methods became popular in recent years, mainly due to their low search costs and flexibility in designing the search space. However, these methods suffer the difficulty in optimizing network, so that the searched network is often unfriendly to hardware. This paper deals with this problem by adding a differentiable latency loss term into optimization, so that the search process can tradeoff between accuracy and latency with a balancing coefficient.
arXiv Detail & Related papers (2020-01-17T15:55:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.