LoopTune: Optimizing Tensor Computations with Reinforcement Learning
- URL: http://arxiv.org/abs/2309.01825v3
- Date: Wed, 8 Nov 2023 16:44:32 GMT
- Title: LoopTune: Optimizing Tensor Computations with Reinforcement Learning
- Authors: Dejan Grubisic, Bram Wasti, Chris Cummins, John Mellor-Crummey,
Aleksandar Zlateski
- Abstract summary: LoopTune is a compiler that optimize tensor computations in deep learning models for the CPU.
With a novel graph-based representation and action space, LoopTune speeds up LoopNest by 3.2x, generating an order of magnitude faster code than TVM, 2.8x faster than MetaSchedule, and 1.08x faster than AutoTVM.
- Score: 43.82827359317833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advanced compiler technology is crucial for enabling machine learning
applications to run on novel hardware, but traditional compilers fail to
deliver performance, popular auto-tuners have long search times and
expert-optimized libraries introduce unsustainable costs. To address this, we
developed LoopTune, a deep reinforcement learning compiler that optimizes
tensor computations in deep learning models for the CPU. LoopTune optimizes
tensor traversal order while using the ultra-fast lightweight code generator
LoopNest to perform hardware-specific optimizations. With a novel graph-based
representation and action space, LoopTune speeds up LoopNest by 3.2x,
generating an order of magnitude faster code than TVM, 2.8x faster than
MetaSchedule, and 1.08x faster than AutoTVM, consistently performing at the
level of the hand-tuned library Numpy. Moreover, LoopTune tunes code in order
of seconds.
Related papers
- COGNAC: Circuit Optimization via Gradients and Noise-Aware Compilation [0.29998889086656577]
We present COGNAC, a novel strategy for compiling quantum circuits.
We use a simple noise model informed by the duration entangling gates.
We reduce a circuit's gate count without the need for a large number of explicit elimination rewrite rules.
arXiv Detail & Related papers (2023-11-05T20:59:27Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor
Programs [11.338285393619042]
We propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering.
With the proposed paradigm, we implement a deep learning compiler - Hidet.
arXiv Detail & Related papers (2022-10-18T05:32:13Z) - LoopStack: a Lightweight Tensor Algebra Compiler Stack [61.04098601022665]
LoopStack is a domain specific compiler stack for tensor operations.
It generates machine code that matches and frequently exceeds the performance of in state-of-the-art machine learning frameworks.
It has a very small memory footprint - a binary size of 245KB, and under 30K lines of effective code makes it ideal for use on mobile and embedded devices.
arXiv Detail & Related papers (2022-05-02T01:57:58Z) - Learning to Make Compiler Optimizations More Effective [11.125012960514471]
LoopLearner predicts which way of writing a loop will lead to efficient compiled code.
We evaluate LoopLearner with 1,895 loops from various performance-relevant benchmarks.
arXiv Detail & Related papers (2021-02-24T10:42:56Z) - Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware
Multifaceted Optimizations [15.659251804042748]
Woodpecker-DL (WPK) is a hardware-aware deep learning framework.
WPK uses graph optimization, automated searches, domain-specific language ( DSL) and system-level exploration to accelerate inference.
We show that on a maximum P100 GPU, we can achieve the speedup of 5.40 over cuDNN and 1.63 over TVM on individual operators, and run up to 1.18 times faster than TeslaRT for end-to-end model inference.
arXiv Detail & Related papers (2020-08-11T07:50:34Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.