Related papers: Tackling the Matrix Multiplication Micro-kernel Generation with Exo

Tackling the Matrix Multiplication Micro-kernel Generation with Exo

URL: http://arxiv.org/abs/2310.17408v2
Date: Fri, 27 Oct 2023 08:28:03 GMT
Title: Tackling the Matrix Multiplication Micro-kernel Generation with Exo
Authors: Adri\'an Castell\'o, Julian Bellavita, Grace Dinh, Yuka Ikarashi, H\'ector Mart\'inez
Abstract summary: We present a step-by-step procedure for generating a dedicated micro-kernel for each new hardware. Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions.
Score: 0.5517652814152908
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The optimization of the matrix multiplication (or GEMM) has been a need during the last decades. This operation is considered the flagship of current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its widespread use in a large variety of scientific applications. The GEMM is usually implemented following the GotoBLAS philosophy, which tiles the GEMM operands and uses a series of nested loops for performance improvement. These approaches extract the maximum computational power of the architectures through small pieces of hardware-oriented, high-performance code called micro-kernel. However, this approach forces developers to generate, with a non-negligible effort, a dedicated micro-kernel for each new hardware. In this work, we present a step-by-step procedure for generating micro-kernels with the Exo compiler that performs close to (or even better than) manually developed microkernels written with intrinsic functions or assembly language. Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions.

Related papers

Code Generation for Cryptographic Kernels using Multi-word Modular Arithmetic on GPU [0.5831737970661138]
Homomorphic encryption (FHE) and zero-knowledge proofs (ZKPs) are emerging as solutions for data security in distributed environments. This paper presents a formalization of multi-word modular arithmetic (MoMA), which breaks down large bit-width integer arithmetic into operations on machine words.
arXiv Detail & Related papers (2025-01-13T18:15:44Z)
Towards a high-performance AI compiler with upstream MLIR [34.89141656581549]
This work proposes a compilation flow using open-source compiler passes to build a framework to achieve ninja performance. We demonstrate this flow with a proof-of-concept MLIR project that uses input IR in Linalg-on-Tensor from Packing and PyTorch.
arXiv Detail & Related papers (2024-04-15T10:35:50Z)
Hybrid programming-model strategies for GPU offloading of electronic structure calculation kernels [2.4898174182192974]
PROGRESS is a library for electronic structure solvers. It implements linear algebra operations for electronic structure kernels. We describe the general strategies used for these implementations on various computer architectures.
arXiv Detail & Related papers (2024-01-24T19:38:01Z)
Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM [0.20971479389679337]
We generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS. We also leverage the Apache TVM framework to derive a complete variety of the processor-specific micro- Kernels for GEMM.
arXiv Detail & Related papers (2023-10-31T10:36:26Z)
Use Your INSTINCT: INSTruction optimization for LLMs usIng Neural bandits Coupled with Transformers [66.823588073584]
Large language models (LLMs) have shown remarkable instruction-following capabilities and achieved impressive performances in various applications. Recent work has used the query-efficient Bayesian optimization (BO) algorithm to automatically optimize the instructions given to black-box LLMs. We propose a neural bandit algorithm which replaces the GP in BO by an NN surrogate to optimize instructions for black-box LLMs.
arXiv Detail & Related papers (2023-10-02T02:01:16Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
Enabling Retargetable Optimizing Compilers for Quantum Accelerators via a Multi-Level Intermediate Representation [78.8942067357231]
We present a multi-level quantum-classical intermediate representation (IR) that enables an optimizing, retargetable, ahead-of-time compiler. We support the entire gate-based OpenQASM 3 language and provide custom extensions for common quantum programming patterns and improved syntax. Our work results in compile times that are 1000x faster than standard Pythonic approaches, and 5-10x faster than comparative standalone quantum language compilers.
arXiv Detail & Related papers (2021-09-01T17:29:47Z)
A Case Study of LLVM-Based Analysis for Optimizing SIMD Code Generation [0.0]
This paper presents a methodology for using LLVM-based tools to tune the DCA++ application that targets the new ARM A64FX processor. By applying these code changes, codespeed was increased by 1.98X and 78 GFlops were achieved on the A64FX processor.
arXiv Detail & Related papers (2021-06-27T22:38:16Z)
SMASH: Sparse Matrix Atomic Scratchpad Hashing [0.0]
In this thesis, we introduce a novel SpGEMM kernel implementation based on the row-wise product approach. We leverage atomic instructions to merge intermediate partial products as they are generated. Our kernel can achieve 9.4x speedup as compared to competing approaches.
arXiv Detail & Related papers (2021-05-29T00:22:50Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels. We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.