Related papers: Autotuning Apache TVM-based Scientific Applications Using Bayesian Optimization

Autotuning Apache TVM-based Scientific Applications Using Bayesian Optimization

URL: http://arxiv.org/abs/2309.07235v1
Date: Wed, 13 Sep 2023 18:15:58 GMT
Title: Autotuning Apache TVM-based Scientific Applications Using Bayesian Optimization
Authors: Xingfu Wu, Praveen Paramasivam, Valerie Taylor
Abstract summary: We propose a new TVM autotuning framework using Bayesian Optimization and use the TVM tensor expression language to implement linear algebra kernels such as LU, Cholesky, and 3mm. We compare the proposed autotuning framework with the TVM autotuning framework AutoTVM with four tuners and find that our framework outperforms AutoTVM in most cases.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Apache TVM (Tensor Virtual Machine), an open source machine learning compiler framework designed to optimize computations across various hardware platforms, provides an opportunity to improve the performance of dense matrix factorizations such as LU (Lower Upper) decomposition and Cholesky decomposition on GPUs and AI (Artificial Intelligence) accelerators. In this paper, we propose a new TVM autotuning framework using Bayesian Optimization and use the TVM tensor expression language to implement linear algebra kernels such as LU, Cholesky, and 3mm. We use these scientific computation kernels to evaluate the effectiveness of our methods on a GPU cluster, called Swing, at Argonne National Laboratory. We compare the proposed autotuning framework with the TVM autotuning framework AutoTVM with four tuners and find that our framework outperforms AutoTVM in most cases.

Related papers

Numerical Pruning for Efficient Autoregressive Models [87.56342118369123]
This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning. Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and modules, respectively. To verify the effectiveness of our method, we provide both theoretical support and extensive experiments.
arXiv Detail & Related papers (2024-12-17T01:09:23Z)
CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization [5.909352339240516]
We present CATBench, a comprehensive benchmarking suite that captures the complexities of compiler autotuning. The benchmarks in CATBench span a range of machine learning-oriented computations, from tensor algebra to image processing and clustering. We validate CATBench on several state-of-the-art algorithms, revealing their strengths and weaknesses.
arXiv Detail & Related papers (2024-06-24T20:15:04Z)
Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM [0.20971479389679337]
We generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS. We also leverage the Apache TVM framework to derive a complete variety of the processor-specific micro- Kernels for GEMM.
arXiv Detail & Related papers (2023-10-31T10:36:26Z)
Mechanic: A Learning Rate Tuner [52.4242550204696]
We introduce a technique for tuning the learning rate scale factor of any base optimization algorithm and schedule automatically, which we call textscmechanic. We rigorously evaluate textscmechanic on a range of large scale deep learning tasks with varying batch sizes, schedules, and base optimization algorithms.
arXiv Detail & Related papers (2023-05-31T19:32:43Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z)
Neighbor Correspondence Matching for Flow-based Video Frame Synthesis [90.14161060260012]
We introduce a neighbor correspondence matching (NCM) algorithm for flow-based frame synthesis. NCM is performed in a current-frame-agnostic fashion to establish multi-scale correspondences in the spatial-temporal neighborhoods of each pixel. coarse-scale module is designed to leverage neighbor correspondences to capture large motion, while the fine-scale module is more efficient to speed up the estimation process.
arXiv Detail & Related papers (2022-07-14T09:17:00Z)
OMLT: Optimization & Machine Learning Toolkit [54.58348769621782]
The optimization and machine learning toolkit (OMLT) is an open-source software package incorporating neural network and gradient-boosted tree surrogate models. We discuss the advances in optimization technology that made OMLT possible and show how OMLT seamlessly integrates with the algebraic modeling language Pyomo.
arXiv Detail & Related papers (2022-02-04T22:23:45Z)
Autotuning PolyBench Benchmarks with LLVM Clang/Polly Loop Optimization Pragmas Using Bayesian Optimization (extended version) [0.8070511670572696]
We use LLVM Clang/Polly loop optimization pragmas to optimize PolyBench benchmarks. We then use the autotuning framework to optimize the pragma parameters to improve their performance. We present loop autotuning without a user's knowledge using a simple mctree autotuning framework to further improve the performance of the Floyd-Warshall benchmark.
arXiv Detail & Related papers (2021-04-27T14:46:57Z)
Vector-Vector-Matrix Architecture: A Novel Hardware-Aware Framework for Low-Latency Inference in NLP Applications [23.37992621844846]
Deep neural networks have become the standard approach to building reliable Natural Language Processing (NLP) applications. We propose a novel vector-vector-matrix architecture (VVMA) which greatly reduces the latency at inference time for NMT. We present empirical results suggesting that our framework can reduce the latency of sequence-to-sequence and Transformer models used for NMT by a factor of four.
arXiv Detail & Related papers (2020-10-06T16:54:08Z)
Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM [0.0]
We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient code. Our cross-thread reduction based implementation achieved competitive or better performance compared with other state-of-the-art frameworks.
arXiv Detail & Related papers (2020-07-26T04:50:51Z)
Auto-PyTorch Tabular: Multi-Fidelity MetaLearning for Efficient and Robust AutoDL [53.40030379661183]
Auto-PyTorch is a framework to enable fully automated deep learning (AutoDL) It combines multi-fidelity optimization with portfolio construction for warmstarting and ensembling of deep neural networks (DNNs) We show that Auto-PyTorch performs better than several state-of-the-art competitors on average.
arXiv Detail & Related papers (2020-06-24T15:15:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.