Tackling the Matrix Multiplication Micro-kernel Generation with Exo
- URL: http://arxiv.org/abs/2310.17408v2
- Date: Fri, 27 Oct 2023 08:28:03 GMT
- Title: Tackling the Matrix Multiplication Micro-kernel Generation with Exo
- Authors: Adri\'an Castell\'o, Julian Bellavita, Grace Dinh, Yuka Ikarashi,
H\'ector Mart\'inez
- Abstract summary: We present a step-by-step procedure for generating a dedicated micro-kernel for each new hardware.
Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions.
- Score: 0.5517652814152908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The optimization of the matrix multiplication (or GEMM) has been a need
during the last decades. This operation is considered the flagship of current
linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its
widespread use in a large variety of scientific applications. The GEMM is
usually implemented following the GotoBLAS philosophy, which tiles the GEMM
operands and uses a series of nested loops for performance improvement. These
approaches extract the maximum computational power of the architectures through
small pieces of hardware-oriented, high-performance code called micro-kernel.
However, this approach forces developers to generate, with a non-negligible
effort, a dedicated micro-kernel for each new hardware.
In this work, we present a step-by-step procedure for generating
micro-kernels with the Exo compiler that performs close to (or even better
than) manually developed microkernels written with intrinsic functions or
assembly language. Our solution also improves the portability of the generated
code, since a hardware target is fully specified by a concise library-based
description of its instructions.
Related papers
- Towards a high-performance AI compiler with upstream MLIR [34.89141656581549]
This work proposes a compilation flow using open-source compiler passes to build a framework to achieve ninja performance.
We demonstrate this flow with a proof-of-concept MLIR project that uses input IR in Linalg-on-Tensor from Packing and PyTorch.
arXiv Detail & Related papers (2024-04-15T10:35:50Z) - Hybrid programming-model strategies for GPU offloading of electronic
structure calculation kernels [2.4898174182192974]
PROGRESS is a library for electronic structure solvers.
It implements linear algebra operations for electronic structure kernels.
We describe the general strategies used for these implementations on various computer architectures.
arXiv Detail & Related papers (2024-01-24T19:38:01Z) - Automatic Generators for a Family of Matrix Multiplication Routines with
Apache TVM [0.20971479389679337]
We generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS.
We also leverage the Apache TVM framework to derive a complete variety of the processor-specific micro- Kernels for GEMM.
arXiv Detail & Related papers (2023-10-31T10:36:26Z) - Use Your INSTINCT: INSTruction optimization for LLMs usIng Neural bandits Coupled with Transformers [66.823588073584]
Large language models (LLMs) have shown remarkable instruction-following capabilities and achieved impressive performances in various applications.
Recent work has used the query-efficient Bayesian optimization (BO) algorithm to automatically optimize the instructions given to black-box LLMs.
We propose a neural bandit algorithm which replaces the GP in BO by an NN surrogate to optimize instructions for black-box LLMs.
arXiv Detail & Related papers (2023-10-02T02:01:16Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Enabling Retargetable Optimizing Compilers for Quantum Accelerators via
a Multi-Level Intermediate Representation [78.8942067357231]
We present a multi-level quantum-classical intermediate representation (IR) that enables an optimizing, retargetable, ahead-of-time compiler.
We support the entire gate-based OpenQASM 3 language and provide custom extensions for common quantum programming patterns and improved syntax.
Our work results in compile times that are 1000x faster than standard Pythonic approaches, and 5-10x faster than comparative standalone quantum language compilers.
arXiv Detail & Related papers (2021-09-01T17:29:47Z) - A Case Study of LLVM-Based Analysis for Optimizing SIMD Code Generation [0.0]
This paper presents a methodology for using LLVM-based tools to tune the DCA++ application that targets the new ARM A64FX processor.
By applying these code changes, codespeed was increased by 1.98X and 78 GFlops were achieved on the A64FX processor.
arXiv Detail & Related papers (2021-06-27T22:38:16Z) - SMASH: Sparse Matrix Atomic Scratchpad Hashing [0.0]
In this thesis, we introduce a novel SpGEMM kernel implementation based on the row-wise product approach.
We leverage atomic instructions to merge intermediate partial products as they are generated.
Our kernel can achieve 9.4x speedup as compared to competing approaches.
arXiv Detail & Related papers (2021-05-29T00:22:50Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.