Related papers: A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels

A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels

URL: http://arxiv.org/abs/2001.07104v3
Date: Wed, 30 Sep 2020 12:47:57 GMT
Title: A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels
Authors: Lorenz Braun, Sotirios Nikas, Chen Song, Vincent Heuveline, Holger Fr\"oning
Abstract summary: This model is built based on random forests using 189 individual compute kernels from benchmarks such as Parboil, Rodinia, Polybench-GPU and SHOC. Evaluation of the model performance using cross-validation yields a median Mean Average Percentage Error (MAPE) of 8.86-52.00% and 1.84-2.94%, for time respectively power prediction across five different GPUs, while latency for a single prediction varies between 15 and 108 milliseconds.
Score: 2.9853894456071077
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Characterizing compute kernel execution behavior on GPUs for efficient task scheduling is a non-trivial task. We address this with a simple model enabling portable and fast predictions among different GPUs using only hardware-independent features. This model is built based on random forests using 189 individual compute kernels from benchmarks such as Parboil, Rodinia, Polybench-GPU and SHOC. Evaluation of the model performance using cross-validation yields a median Mean Average Percentage Error (MAPE) of 8.86-52.00% and 1.84-2.94%, for time respectively power prediction across five different GPUs, while latency for a single prediction varies between 15 and 108 milliseconds.

Related papers

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach [0.8192907805418583]
This study employs two approaches: a custom-implemented tiled matrix multiplication kernel and NVIDIA's CUTLASS library. We developed a Random Forest-based prediction model with multi-output regression capability. Our framework achieved exceptional accuracy with an R2 score of 0.98 for runtime prediction and 0.78 for power prediction.
arXiv Detail & Related papers (2024-11-25T21:47:23Z)
Benchmarking Edge Computing Devices for Grape Bunches and Trunks Detection using Accelerated Object Detection Single Shot MultiBox Deep Learning Models [2.1922186455344796]
This work benchmarks the performance of different platforms for object detection in real-time. Authors used the RetinaNet ResNet-50 fine-tuned using the natural Vine dataset.
arXiv Detail & Related papers (2022-11-21T17:02:33Z)
Tech Report: One-stage Lightweight Object Detectors [0.38073142980733]
This work is for designing one-stage lightweight detectors which perform well in terms of mAP and latency. With baseline models each of which targets on GPU and CPU respectively, various operations are applied instead of the main operations in backbone networks of baseline models.
arXiv Detail & Related papers (2022-10-31T09:02:37Z)
MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware. Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters. We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z)
Building a Performance Model for Deep Learning Recommendation Model Training on GPUs [6.05245376098191]
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM) We show that both the device active time (the sum of kernel runtimes) and the device idle time are important components of the overall device time. We propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph.
arXiv Detail & Related papers (2022-01-19T19:05:42Z)
Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
Data-Efficient Instance Segmentation with a Single GPU [88.31338435907304]
We introduce a data-efficient segmentation method we used in the 2021 VIPriors Instance Challenge. Our solution is a modified version of Swin Transformer, based on the mmdetection which is a powerful toolbox. Our method achieved the AP@0.50:0.95 (medium) of 0.592, which ranks second among all contestants.
arXiv Detail & Related papers (2021-10-01T07:36:20Z)
Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices. Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices. Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z)
Efficient Video Semantic Segmentation with Labels Propagation and Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach. We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next. On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.