Related papers: Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

URL: http://arxiv.org/abs/2007.13055v1
Date: Sun, 26 Jul 2020 04:50:51 GMT
Title: Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM
Authors: Zijing Gu
Abstract summary: We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient code. Our cross-thread reduction based implementation achieved competitive or better performance compared with other state-of-the-art frameworks.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We implemented and optimized matrix multiplications between dense and block-sparse matrices on CUDA. We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient CUDA code. With the automatic parameter tuning in TVM, our cross-thread reduction based implementation achieved competitive or better performance compared with other state-of-the-art frameworks.

Related papers

SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution [4.14360329494344]
We present a novel approach for accelerating convolutions during inference for CPU-based architectures. Our experiments with commonly used network architectures demonstrate a significant speedup compared to existing indirect methods.
arXiv Detail & Related papers (2024-11-23T21:43:38Z)
Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain. We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance. We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z)
Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel Max Series GPU [0.0]
We focus on three matrix operations relevant for machine learning applications. We develop optimized implementations for SPMM, SDDMM, and FusedMM operations utilizing Intel oneAPI's Explicit SIMD (ESIMD) SYCL extension API.
arXiv Detail & Related papers (2023-11-01T08:43:59Z)
Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM [0.20971479389679337]
We generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS. We also leverage the Apache TVM framework to derive a complete variety of the processor-specific micro- Kernels for GEMM.
arXiv Detail & Related papers (2023-10-31T10:36:26Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
Autotuning Apache TVM-based Scientific Applications Using Bayesian Optimization [0.0]
We propose a new TVM autotuning framework using Bayesian Optimization and use the TVM tensor expression language to implement linear algebra kernels such as LU, Cholesky, and 3mm. We compare the proposed autotuning framework with the TVM autotuning framework AutoTVM with four tuners and find that our framework outperforms AutoTVM in most cases.
arXiv Detail & Related papers (2023-09-13T18:15:58Z)
Optimized Sparse Matrix Operations for Reverse Mode Automatic Differentiation [3.72826300260966]
We present an implementation of a CSR-based sparse matrix wrapper for PyTorch with acceleration for basic matrix operations, as well as automatic differentiability. We also present several applications of the resulting sparse kernels to optimization problems, demonstrating ease of implementation and performance measurements versus their dense counterparts.
arXiv Detail & Related papers (2022-12-10T00:46:51Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
High-Dimensional Sparse Bayesian Learning without Covariance Matrices [66.60078365202867]
We introduce a new inference scheme that avoids explicit construction of the covariance matrix. Our approach couples a little-known diagonal estimation result from numerical linear algebra with the conjugate gradient algorithm. On several simulations, our method scales better than existing approaches in computation time and memory.
arXiv Detail & Related papers (2022-02-25T16:35:26Z)
Fast Differentiable Matrix Square Root and Inverse Square Root [65.67315418971688]
We propose two more efficient variants to compute the differentiable matrix square root and the inverse square root. For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad'e Approximants (MPA) A series of numerical tests show that both methods yield considerable speed-up compared with the SVD or the NS iteration.
arXiv Detail & Related papers (2022-01-29T10:00:35Z)
Fast Differentiable Matrix Square Root [65.67315418971688]
We propose two more efficient variants to compute the differentiable matrix square root. For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP) The other method is to use Matrix Pad'e Approximants (MPA)
arXiv Detail & Related papers (2022-01-21T12:18:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.