Related papers: SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation

SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation

URL: http://arxiv.org/abs/2403.16863v1
Date: Mon, 25 Mar 2024 15:26:50 GMT
Title: SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation
Authors: Guoliang He, Eiko Yoneki,
Abstract summary: Large language models (LLMs) have become a significant workload since their appearance. They are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Recent works have developed dedicated kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) have become a significant workload since their appearance. However, they are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Thus, recent works have developed dedicated CUDA kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible. In this work, we explore the possibility of GPU native instruction optimization to further push the CUDA kernels to extreme performance. Contrary to prior works, we adopt an automatic optimization approach by defining a search space of possible GPU native instruction schedules, and then we apply stochastic search to perform optimization. Experiments show that SIP can further improve CUDA kernel throughput by automatically discovering better GPU native instruction schedules and the optimized schedules are tested by 10 million test samples.

Related papers

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels [9.287036563375617]
Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation.<n>We propose a novel framework called textbfFeature SearchReinforcement (FSR) FSR jointly optimize compilation and functional correctness.
arXiv Detail & Related papers (2025-06-10T10:51:03Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning [0.0]
In this work, we employ an automatic approach to optimize GPU SASS schedules. The key to automatic optimization is training an RL agent to mimic how human experts perform manual scheduling. Experiments show that CuAsmRL can further improve the performance of existing kernels by up $26%$, and on average $9%$.
arXiv Detail & Related papers (2025-01-14T12:36:18Z)
Implementation and Analysis of GPU Algorithms for Vecchia Approximation [0.8057006406834466]
Vecchia Approximation is widely used to reduce the computational complexity and can be calculated with embarrassingly parallel algorithms. While multi-core software has been developed for Vecchia Approximation, software designed to run on graphics processing units ( GPU) is lacking. We show that our new method outperforms the other two and then present it in the GpGpU R package.
arXiv Detail & Related papers (2024-07-03T01:24:44Z)
Optimal Kernel Tuning Parameter Prediction using Deep Sequence Models [0.44998333629984877]
We propose a methodology that uses deep sequence- to-sequence models to predict the optimal tuning parameters governing compute kernels. The proposed algorithm can achieve more than 90% accuracy on various convolutional kernels in MIOpen, the AMD machine learning primitives library.
arXiv Detail & Related papers (2024-04-15T22:25:54Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures [0.0]
We develop benchmarks that take into account performance-relevant source-code parameters and reach near peak-performance on various GPU architectures. With our framework Kernel Tuning Toolkit, we measured times and hardware performance counters on several GPU for the complete tuning spaces of five benchmarks. We describe the scripts we used for robust evaluation of our searcher and comparison to others in detail.
arXiv Detail & Related papers (2021-02-10T07:51:09Z)
Systolic Computing on GPUs for Productive Performance [2.8064596842326575]
We propose a language and compiler to productively build high-performance systolic arrays that run on GPUs. A programmer it' specifies a projection of a dataflow compute onto a linear systolic array, while leaving the detailed implementation of the projection to a compiler. The compiler implements the specified projection and maps the linear systolic array to the SIMD execution units and vector registers of GPUs.
arXiv Detail & Related papers (2020-10-29T18:49:54Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
GEVO: GPU Code Optimization using Evolutionary Computation [12.9965710635562]
GEVO is a tool for discovering optimization opportunities and tuning the performance of GPU kernels in the LLVM representation. GEVO improves execution time of the GPU programs in the Rodinia benchmark suite and the machine learning models, SVM and ResNet18, on NVIDIA Tesla P100. GEVO achieves 1.79X kernel performance improvement on image classification using ResNet18/CIFAR-10, with less than 1% model accuracy reduction.
arXiv Detail & Related papers (2020-04-17T09:36:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.