Optimal Kernel Orchestration for Tensor Programs with Korch
- URL: http://arxiv.org/abs/2406.09465v1
- Date: Thu, 13 Jun 2024 04:44:38 GMT
- Title: Optimal Kernel Orchestration for Tensor Programs with Korch
- Authors: Muyan Hu, Ashwin Venkatram, Shreyashri Biswas, Balamurugan Marimuthu, Bohan Hou, Gabriele Oliaro, Haojie Wang, Liyan Zheng, Xupeng Miao, Jidong Zhai,
- Abstract summary: Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms.
This paper presents Korch, a program that discovers optimal kernel orchestration strategies for tensor programs.
- Score: 13.143585283794902
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel orchestration. This paper presents Korch, a tensor program optimizer that discovers optimal kernel orchestration strategies for tensor programs. Instead of directly fusing operators, Korch first applies operator fission to decompose tensor operators into a small set of basic tensor algebra primitives. This decomposition enables a diversity of fine-grained, inter-operator optimizations. Next, Korch optimizes kernel orchestration by formalizing it as a constrained optimization problem, leveraging an off-the-shelf binary linear programming solver to discover an optimal orchestration strategy, and generating an executable that can be directly deployed on modern GPU platforms. Evaluation on a variety of DNNs shows that Korch outperforms existing tensor program optimizers by up to 1.7x on V100 GPUs and up to 1.6x on A100 GPUs. Korch is publicly available at https://github.com/humuyan/Korch.
Related papers
- Explore as a Storm, Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent [48.791943145735]
We show the potential to reduce Ansor's search time while enhancing kernel quality.
We apply this approach to the first 300 kernels that Ansor generates.
This result has been replicated in 20 well-known deep-learning models.
arXiv Detail & Related papers (2024-06-28T16:34:22Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - ALT: Breaking the Wall between Graph and Operator Level Optimizations
for Deep Learning Compilation [38.8918502461244]
ALT is a compiler that performs joint graph- and operator-level optimizations for deep models.
JOG significantly outperforms state-of-the-art compilers (e.g., Ansor) in terms of both single operator performance and end-to-end inference performance.
arXiv Detail & Related papers (2022-10-22T11:09:36Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - OLLIE: Derivation-based Tensor Program Optimizer [13.23204410403652]
We propose OLLIE, the first derivation-based tensor program.
We show that OLLIE can outperform existing tensor expressions by up to 2.73$times$ (1.46$times$ on average) on an A100 GPU and up to 2.68$times$1$times$ on a V100 GPU.
arXiv Detail & Related papers (2022-08-02T14:38:58Z) - A Unified Framework for Implicit Sinkhorn Differentiation [58.56866763433335]
We propose an algorithm that obtains analytical gradients of a Sinkhorn layer via implicit differentiation.
We show that it is computationally more efficient, particularly when resources like GPU memory are scarce.
arXiv Detail & Related papers (2022-05-13T14:45:31Z) - The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal
Padding [14.635810503599759]
CoRa is a tensor compiler that allows users to easily generate efficient code for ragged tensor operators.
We evaluate CoRa on a variety of operators on ragged tensors as well as on an encoder layer of the transformer model.
arXiv Detail & Related papers (2021-10-19T19:39:04Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - Kernel Operations on the GPU, with Autodiff, without Memory Overflows [5.669790037378094]
The KeOps library provides a fast and memory-efficient GPU support for tensors whose entries are given by a mathematical formula.
KeOps alleviates the major bottleneck of tensor-centric libraries for kernel and geometric applications: memory consumption.
KeOps combines optimized C++/CUDA schemes with binders for high-level languages: Python (Numpy and PyTorch), Matlab and R.
arXiv Detail & Related papers (2020-03-27T08:54:10Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.