Related papers: LoopTune: Optimizing Tensor Computations with Reinforcement Learning

LoopTune: Optimizing Tensor Computations with Reinforcement Learning

URL: http://arxiv.org/abs/2309.01825v3
Date: Wed, 8 Nov 2023 16:44:32 GMT
Title: LoopTune: Optimizing Tensor Computations with Reinforcement Learning
Authors: Dejan Grubisic, Bram Wasti, Chris Cummins, John Mellor-Crummey, Aleksandar Zlateski
Abstract summary: LoopTune is a compiler that optimize tensor computations in deep learning models for the CPU. With a novel graph-based representation and action space, LoopTune speeds up LoopNest by 3.2x, generating an order of magnitude faster code than TVM, 2.8x faster than MetaSchedule, and 1.08x faster than AutoTVM.
Score: 43.82827359317833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Advanced compiler technology is crucial for enabling machine learning applications to run on novel hardware, but traditional compilers fail to deliver performance, popular auto-tuners have long search times and expert-optimized libraries introduce unsustainable costs. To address this, we developed LoopTune, a deep reinforcement learning compiler that optimizes tensor computations in deep learning models for the CPU. LoopTune optimizes tensor traversal order while using the ultra-fast lightweight code generator LoopNest to perform hardware-specific optimizations. With a novel graph-based representation and action space, LoopTune speeds up LoopNest by 3.2x, generating an order of magnitude faster code than TVM, 2.8x faster than MetaSchedule, and 1.08x faster than AutoTVM, consistently performing at the level of the hand-tuned library Numpy. Moreover, LoopTune tunes code in order of seconds.

Related papers

COGNAC: Circuit Optimization via Gradients and Noise-Aware Compilation [0.29998889086656577]
We present COGNAC, a novel strategy for compiling quantum circuits. We use a simple noise model informed by the duration entangling gates. We reduce a circuit's gate count without the need for a large number of explicit elimination rewrite rules.
arXiv Detail & Related papers (2023-11-05T20:59:27Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs [11.338285393619042]
We propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering. With the proposed paradigm, we implement a deep learning compiler - Hidet.
arXiv Detail & Related papers (2022-10-18T05:32:13Z)
LoopStack: a Lightweight Tensor Algebra Compiler Stack [61.04098601022665]
LoopStack is a domain specific compiler stack for tensor operations. It generates machine code that matches and frequently exceeds the performance of in state-of-the-art machine learning frameworks. It has a very small memory footprint - a binary size of 245KB, and under 30K lines of effective code makes it ideal for use on mobile and embedded devices.
arXiv Detail & Related papers (2022-05-02T01:57:58Z)
Learning to Make Compiler Optimizations More Effective [11.125012960514471]
LoopLearner predicts which way of writing a loop will lead to efficient compiled code. We evaluate LoopLearner with 1,895 loops from various performance-relevant benchmarks.
arXiv Detail & Related papers (2021-02-24T10:42:56Z)
Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations [15.659251804042748]
Woodpecker-DL (WPK) is a hardware-aware deep learning framework. WPK uses graph optimization, automated searches, domain-specific language ( DSL) and system-level exploration to accelerate inference. We show that on a maximum P100 GPU, we can achieve the speedup of 5.40 over cuDNN and 1.63 over TVM on individual operators, and run up to 1.18 times faster than TeslaRT for end-to-end model inference.
arXiv Detail & Related papers (2020-08-11T07:50:34Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels. We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.