GEVO: GPU Code Optimization using Evolutionary Computation
- URL: http://arxiv.org/abs/2004.08140v2
- Date: Mon, 27 Apr 2020 21:30:52 GMT
- Title: GEVO: GPU Code Optimization using Evolutionary Computation
- Authors: Jhe-Yu Liou, Xiaodong Wang, Stephanie Forrest, Carole-Jean Wu
- Abstract summary: GEVO is a tool for discovering optimization opportunities and tuning the performance of GPU kernels in the LLVM representation.
GEVO improves execution time of the GPU programs in the Rodinia benchmark suite and the machine learning models, SVM and ResNet18, on NVIDIA Tesla P100.
GEVO achieves 1.79X kernel performance improvement on image classification using ResNet18/CIFAR-10, with less than 1% model accuracy reduction.
- Score: 12.9965710635562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: GPUs are a key enabler of the revolution in machine learning and high
performance computing, functioning as de facto co-processors to accelerate
large-scale computation. As the programming stack and tool support have
matured, GPUs have also become accessible to programmers, who may lack detailed
knowledge of the underlying architecture and fail to fully leverage the GPU's
computation power. GEVO (Gpu optimization using EVOlutionary computation) is a
tool for automatically discovering optimization opportunities and tuning the
performance of GPU kernels in the LLVM representation. GEVO uses
population-based search to find edits to GPU code compiled to LLVM-IR and
improves performance on desired criteria while retaining required
functionality. We demonstrate that GEVO improves the execution time of the GPU
programs in the Rodinia benchmark suite and the machine learning models, SVM
and ResNet18, on NVIDIA Tesla P100. For the Rodinia benchmarks, GEVO improves
GPU kernel runtime performance by an average of 49.48% and by as much as 412%
over the fully compiler-optimized baseline. If kernel output accuracy is
relaxed to tolerate up to 1% error, GEVO can find kernel variants that
outperform the baseline version by an average of 51.08%. For the machine
learning workloads, GEVO achieves kernel performance improvement for SVM on the
MNIST handwriting recognition (3.24X) and the a9a income prediction (2.93X)
datasets with no loss of model accuracy. GEVO achieves 1.79X kernel performance
improvement on image classification using ResNet18/CIFAR-10, with less than 1%
model accuracy reduction.
Related papers
- Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics [0.0]
This article presents an optimized algorithm and implementation for calculating resolution-of-the-identity Hartree-Fock energies and analytic gradients using multiple Graphics Processing Units (GPUs)
The algorithm is especially designed for high throughput emphab initio molecular dynamics simulations of small and medium size molecules (10-100 atoms)
arXiv Detail & Related papers (2024-07-29T00:14:10Z) - Forecasting GPU Performance for Deep Learning Training and Inference [10.741682409837612]
NeuSight is a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution.
NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU.
It reduces the percentage error from 198% and 19.7% to 3.8% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior works.
arXiv Detail & Related papers (2024-07-18T18:47:52Z) - Explore as a Storm, Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent [48.791943145735]
We show the potential to reduce Ansor's search time while enhancing kernel quality.
We apply this approach to the first 300 kernels that Ansor generates.
This result has been replicated in 20 well-known deep-learning models.
arXiv Detail & Related papers (2024-06-28T16:34:22Z) - Optimal Kernel Tuning Parameter Prediction using Deep Sequence Models [0.44998333629984877]
We propose a methodology that uses deep sequence- to-sequence models to predict the optimal tuning parameters governing compute kernels.
The proposed algorithm can achieve more than 90% accuracy on various convolutional kernels in MIOpen, the AMD machine learning primitives library.
arXiv Detail & Related papers (2024-04-15T22:25:54Z) - SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation [0.0]
Large language models (LLMs) have become a significant workload since their appearance.
They are also computationally expensive as they have billions of parameters and are trained with massive amounts of data.
Recent works have developed dedicated kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible.
arXiv Detail & Related papers (2024-03-25T15:26:50Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training [18.52206409432894]
DistTGL is an efficient and scalable solution to train memory-based TGNNs on distributed GPU clusters.
In experiments, DistTGL achieves near-linear convergence speedup, outperforming state-of-the-art single-machine method by 14.5% in accuracy and 10.17x in training throughput.
arXiv Detail & Related papers (2023-07-14T22:52:27Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.