Related papers: Optimal Kernel Orchestration for Tensor Programs with Korch

Optimal Kernel Orchestration for Tensor Programs with Korch

URL: http://arxiv.org/abs/2406.09465v1
Date: Thu, 13 Jun 2024 04:44:38 GMT
Title: Optimal Kernel Orchestration for Tensor Programs with Korch
Authors: Muyan Hu, Ashwin Venkatram, Shreyashri Biswas, Balamurugan Marimuthu, Bohan Hou, Gabriele Oliaro, Haojie Wang, Liyan Zheng, Xupeng Miao, Jidong Zhai,
Abstract summary: Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. This paper presents Korch, a program that discovers optimal kernel orchestration strategies for tensor programs.
Score: 13.143585283794902
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel orchestration. This paper presents Korch, a tensor program optimizer that discovers optimal kernel orchestration strategies for tensor programs. Instead of directly fusing operators, Korch first applies operator fission to decompose tensor operators into a small set of basic tensor algebra primitives. This decomposition enables a diversity of fine-grained, inter-operator optimizations. Next, Korch optimizes kernel orchestration by formalizing it as a constrained optimization problem, leveraging an off-the-shelf binary linear programming solver to discover an optimal orchestration strategy, and generating an executable that can be directly deployed on modern GPU platforms. Evaluation on a variety of DNNs shows that Korch outperforms existing tensor program optimizers by up to 1.7x on V100 GPUs and up to 1.6x on A100 GPUs. Korch is publicly available at https://github.com/humuyan/Korch.

Related papers

An Efficient Sparse Kernel Generator for O(3)-Equivariant Deep Networks [0.5737287537823071]
Rotation equivariant graph neural networks yield state-of-the-art performance on spatial deep learning tasks. Key to these models is the Clebsch-Gordon (CG) tensor product, a kernel that contracts two dense feature vectors with a highly structured sparse tensor to produce a dense output vector. We introduce a GPU sparse kernel generator for the CG tensor product that provides significant speedup over the best existing open and closed-source implementations.
arXiv Detail & Related papers (2025-01-23T08:20:47Z)
A User's Guide to $\texttt{KSig}$: GPU-Accelerated Computation of the Signature Kernel [12.111848705677138]
The signature kernel is a positive definite kernel for sequential and temporal data. In this chapter, we give a short introduction to $textttKSig$, a $textttScikit-Learn$ compatible Python package that implements various GPU-accelerated algorithms for computing signature kernels.
arXiv Detail & Related papers (2025-01-13T09:11:13Z)
Explore as a Storm, Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent [48.791943145735]
We show the potential to reduce Ansor's search time while enhancing kernel quality. We apply this approach to the first 300 kernels that Ansor generates. This result has been replicated in 20 well-known deep-learning models.
arXiv Detail & Related papers (2024-06-28T16:34:22Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
ALT: Breaking the Wall between Graph and Operator Level Optimizations for Deep Learning Compilation [38.8918502461244]
ALT is a compiler that performs joint graph- and operator-level optimizations for deep models. JOG significantly outperforms state-of-the-art compilers (e.g., Ansor) in terms of both single operator performance and end-to-end inference performance.
arXiv Detail & Related papers (2022-10-22T11:09:36Z)
Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel. Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU. Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z)
OLLIE: Derivation-based Tensor Program Optimizer [13.23204410403652]
We propose OLLIE, the first derivation-based tensor program. We show that OLLIE can outperform existing tensor expressions by up to 2.73$times$ (1.46$times$ on average) on an A100 GPU and up to 2.68$times$1$times$ on a V100 GPU.
arXiv Detail & Related papers (2022-08-02T14:38:58Z)
A Unified Framework for Implicit Sinkhorn Differentiation [58.56866763433335]
We propose an algorithm that obtains analytical gradients of a Sinkhorn layer via implicit differentiation. We show that it is computationally more efficient, particularly when resources like GPU memory are scarce.
arXiv Detail & Related papers (2022-05-13T14:45:31Z)
The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding [14.635810503599759]
CoRa is a tensor compiler that allows users to easily generate efficient code for ragged tensor operators. We evaluate CoRa on a variety of operators on ragged tensors as well as on an encoder layer of the transformer model.
arXiv Detail & Related papers (2021-10-19T19:39:04Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
Kernel Operations on the GPU, with Autodiff, without Memory Overflows [5.669790037378094]
The KeOps library provides a fast and memory-efficient GPU support for tensors whose entries are given by a mathematical formula. KeOps alleviates the major bottleneck of tensor-centric libraries for kernel and geometric applications: memory consumption. KeOps combines optimized C++/CUDA schemes with binders for high-level languages: Python (Numpy and PyTorch), Matlab and R.
arXiv Detail & Related papers (2020-03-27T08:54:10Z)
PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels. We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.