A Simple Model for Portable and Fast Prediction of Execution Time and
Power Consumption of GPU Kernels
- URL: http://arxiv.org/abs/2001.07104v3
- Date: Wed, 30 Sep 2020 12:47:57 GMT
- Title: A Simple Model for Portable and Fast Prediction of Execution Time and
Power Consumption of GPU Kernels
- Authors: Lorenz Braun, Sotirios Nikas, Chen Song, Vincent Heuveline, Holger
Fr\"oning
- Abstract summary: This model is built based on random forests using 189 individual compute kernels from benchmarks such as Parboil, Rodinia, Polybench-GPU and SHOC.
Evaluation of the model performance using cross-validation yields a median Mean Average Percentage Error (MAPE) of 8.86-52.00% and 1.84-2.94%, for time respectively power prediction across five different GPUs, while latency for a single prediction varies between 15 and 108 milliseconds.
- Score: 2.9853894456071077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Characterizing compute kernel execution behavior on GPUs for efficient task
scheduling is a non-trivial task. We address this with a simple model enabling
portable and fast predictions among different GPUs using only
hardware-independent features. This model is built based on random forests
using 189 individual compute kernels from benchmarks such as Parboil, Rodinia,
Polybench-GPU and SHOC. Evaluation of the model performance using
cross-validation yields a median Mean Average Percentage Error (MAPE) of
8.86-52.00% and 1.84-2.94%, for time respectively power prediction across five
different GPUs, while latency for a single prediction varies between 15 and 108
milliseconds.
Related papers
- Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach [0.8192907805418583]
This study employs two approaches: a custom-implemented tiled matrix multiplication kernel and NVIDIA's CUTLASS library.
We developed a Random Forest-based prediction model with multi-output regression capability.
Our framework achieved exceptional accuracy with an R2 score of 0.98 for runtime prediction and 0.78 for power prediction.
arXiv Detail & Related papers (2024-11-25T21:47:23Z) - Benchmarking Edge Computing Devices for Grape Bunches and Trunks
Detection using Accelerated Object Detection Single Shot MultiBox Deep
Learning Models [2.1922186455344796]
This work benchmarks the performance of different platforms for object detection in real-time.
Authors used the RetinaNet ResNet-50 fine-tuned using the natural Vine dataset.
arXiv Detail & Related papers (2022-11-21T17:02:33Z) - Tech Report: One-stage Lightweight Object Detectors [0.38073142980733]
This work is for designing one-stage lightweight detectors which perform well in terms of mAP and latency.
With baseline models each of which targets on GPU and CPU respectively, various operations are applied instead of the main operations in backbone networks of baseline models.
arXiv Detail & Related papers (2022-10-31T09:02:37Z) - MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware.
Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters.
We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z) - Building a Performance Model for Deep Learning Recommendation Model
Training on GPUs [6.05245376098191]
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM)
We show that both the device active time (the sum of kernel runtimes) and the device idle time are important components of the overall device time.
We propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph.
arXiv Detail & Related papers (2022-01-19T19:05:42Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Data-Efficient Instance Segmentation with a Single GPU [88.31338435907304]
We introduce a data-efficient segmentation method we used in the 2021 VIPriors Instance Challenge.
Our solution is a modified version of Swin Transformer, based on the mmdetection which is a powerful toolbox.
Our method achieved the AP@0.50:0.95 (medium) of 0.592, which ranks second among all contestants.
arXiv Detail & Related papers (2021-10-01T07:36:20Z) - Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z) - Efficient Video Semantic Segmentation with Labels Propagation and
Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach.
We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next.
On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.