LoopStack: a Lightweight Tensor Algebra Compiler Stack
- URL: http://arxiv.org/abs/2205.00618v1
- Date: Mon, 2 May 2022 01:57:58 GMT
- Title: LoopStack: a Lightweight Tensor Algebra Compiler Stack
- Authors: Bram Wasti, Jos\'e Pablo Cambronero, Benoit Steiner, Hugh Leather and
Aleksandar Zlateski
- Abstract summary: LoopStack is a domain specific compiler stack for tensor operations.
It generates machine code that matches and frequently exceeds the performance of in state-of-the-art machine learning frameworks.
It has a very small memory footprint - a binary size of 245KB, and under 30K lines of effective code makes it ideal for use on mobile and embedded devices.
- Score: 61.04098601022665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present LoopStack, a domain specific compiler stack for tensor operations,
composed of a frontend, LoopTool, and an efficient optimizing code generator,
LoopNest. This stack enables us to compile entire neural networks and generate
code targeting the AVX2, AVX512, NEON, and NEONfp16 instruction sets while
incorporating optimizations often missing from other machine learning compiler
backends. We evaluate our stack on a collection of full neural networks and
commonly used network blocks as well as individual operators, and show that
LoopStack generates machine code that matches and frequently exceeds the
performance of in state-of-the-art machine learning frameworks in both cases.
We also show that for a large collection of schedules LoopNest's compilation is
orders of magnitude faster than LLVM, while resulting in equal or improved run
time performance. Additionally, LoopStack has a very small memory footprint - a
binary size of 245KB, and under 30K lines of effective code makes it ideal for
use on mobile and embedded devices.
Related papers
- No Saved Kaleidosope: an 100% Jitted Neural Network Coding Language with Pythonic Syntax [0.8408735228878615]
We developed a jitted compiler for training Artificial Neural Networks using C++, LLVM and Cuda.
It features object-oriented characteristics, strong typing, parallel workers for data pre-processing, pythonic syntax for expressions, PyTorch like model declaration and Automatic Differentiation.
arXiv Detail & Related papers (2024-09-17T23:15:39Z) - Register Your Forests: Decision Tree Ensemble Optimization by Explicit CPU Register Allocation [3.737361598712633]
We present a code generation approach for decision tree ensembles, which produces machine assembly code within a single conversion step.
The results show that the performance of decision tree ensemble inference can be significantly improved.
arXiv Detail & Related papers (2024-04-10T09:17:22Z) - LoopTune: Optimizing Tensor Computations with Reinforcement Learning [43.82827359317833]
LoopTune is a compiler that optimize tensor computations in deep learning models for the CPU.
With a novel graph-based representation and action space, LoopTune speeds up LoopNest by 3.2x, generating an order of magnitude faster code than TVM, 2.8x faster than MetaSchedule, and 1.08x faster than AutoTVM.
arXiv Detail & Related papers (2023-09-04T21:30:15Z) - PowerFusion: A Tensor Compiler with Explicit Data Movement Description
and Instruction-level Graph IR [10.059491353103526]
We propose IntelliGen, a tensor compiler that can generate high-performance code for memory-intensive operators.
IntelliGen considers both computation and data movement optimizations.
We evaluate IntelliGen on NVIDIA GPU, AMD GPU, and Cambricon MLU, showing speedup up to 1.97x, 2.93x, and 16.91x (1.28x, 1.23x, and 2.31x on average)
arXiv Detail & Related papers (2023-07-11T03:17:40Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - HDCC: A Hyperdimensional Computing compiler for classification on
embedded systems and high-performance computing [58.720142291102135]
This work introduces the name compiler, the first open-source compiler that translates high-level descriptions of HDC classification methods into optimized C code.
name is designed like a modern compiler, featuring an intuitive and descriptive input language, an intermediate representation (IR), and a retargetable backend.
To substantiate these claims, we conducted experiments with HDCC on several of the most popular datasets in the HDC literature.
arXiv Detail & Related papers (2023-04-24T19:16:03Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z) - VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator.
textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z) - Learning to Make Compiler Optimizations More Effective [11.125012960514471]
LoopLearner predicts which way of writing a loop will lead to efficient compiled code.
We evaluate LoopLearner with 1,895 loops from various performance-relevant benchmarks.
arXiv Detail & Related papers (2021-02-24T10:42:56Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.