Automatic Generators for a Family of Matrix Multiplication Routines with
Apache TVM
- URL: http://arxiv.org/abs/2310.20347v1
- Date: Tue, 31 Oct 2023 10:36:26 GMT
- Title: Automatic Generators for a Family of Matrix Multiplication Routines with
Apache TVM
- Authors: Guillermo Alaejos, Adri\'an Castell\'o, Pedro Alonso-Jord\'a,
Francisco D. Igual, H\'ector Mart\'inez, Enrique S. Quintana-Ort\'i
- Abstract summary: We generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS.
We also leverage the Apache TVM framework to derive a complete variety of the processor-specific micro- Kernels for GEMM.
- Score: 0.20971479389679337
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We explore the utilization of the Apache TVM open source framework to
automatically generate a family of algorithms that follow the approach taken by
popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS, in
order to obtain high-performance blocked formulations of the general matrix
multiplication (GEMM). % In addition, we fully automatize the generation
process, by also leveraging the Apache TVM framework to derive a complete
variety of the processor-specific micro-kernels for GEMM. This is in contrast
with the convention in high performance libraries, which hand-encode a single
micro-kernel per architecture using Assembly code. % In global, the combination
of our TVM-generated blocked algorithms and micro-kernels for GEMM 1)~improves
portability, maintainability and, globally, streamlines the software life
cycle; 2)~provides high flexibility to easily tailor and optimize the solution
to different data types, processor architectures, and matrix operand shapes,
yielding performance on a par (or even superior for specific matrix shapes)
with that of hand-tuned libraries; and 3)~features a small memory footprint.
Related papers
- Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels [12.77187564450236]
We introduce XY-Serve, a versatile, Ascend native, end-to-end production large language model (LLM) serving system.
The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into fine-grained meta primitives.
For GEMM, we introduce a virtual padding scheme that adapts to dynamic shape changes while using highly efficient GEMM primitives with assorted fixed tile sizes.
arXiv Detail & Related papers (2024-12-24T02:27:44Z) - Hybrid programming-model strategies for GPU offloading of electronic
structure calculation kernels [2.4898174182192974]
PROGRESS is a library for electronic structure solvers.
It implements linear algebra operations for electronic structure kernels.
We describe the general strategies used for these implementations on various computer architectures.
arXiv Detail & Related papers (2024-01-24T19:38:01Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on
NVIDIA Hopper Architecture using the CUTLASS Library [0.7366405857677227]
We provide an optimized implementation of the forward pass of FlashAttention-2 as a custom fused kernel targeting NVIDIA Hopper architecture.
We observe 20-50% higher FLOPs/s over a version of FlashAttention-2 optimized for last-generation NVIDIA Ampere architecture.
arXiv Detail & Related papers (2023-12-19T07:56:25Z) - Tackling the Matrix Multiplication Micro-kernel Generation with Exo [0.5517652814152908]
We present a step-by-step procedure for generating a dedicated micro-kernel for each new hardware.
Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions.
arXiv Detail & Related papers (2023-10-26T14:09:57Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Reconfigurable Intelligent Surface Assisted Mobile Edge Computing with
Heterogeneous Learning Tasks [53.1636151439562]
Mobile edge computing (MEC) provides a natural platform for AI applications.
We present an infrastructure to perform machine learning tasks at an MEC with the assistance of a reconfigurable intelligent surface (RIS)
Specifically, we minimize the learning error of all participating users by jointly optimizing transmit power of mobile users, beamforming vectors of the base station, and the phase-shift matrix of the RIS.
arXiv Detail & Related papers (2020-12-25T07:08:50Z) - Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM [0.0]
We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient code.
Our cross-thread reduction based implementation achieved competitive or better performance compared with other state-of-the-art frameworks.
arXiv Detail & Related papers (2020-07-26T04:50:51Z) - Auto-PyTorch Tabular: Multi-Fidelity MetaLearning for Efficient and
Robust AutoDL [53.40030379661183]
Auto-PyTorch is a framework to enable fully automated deep learning (AutoDL)
It combines multi-fidelity optimization with portfolio construction for warmstarting and ensembling of deep neural networks (DNNs)
We show that Auto-PyTorch performs better than several state-of-the-art competitors on average.
arXiv Detail & Related papers (2020-06-24T15:15:17Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.