Make Inference Faster: Efficient GPU Memory Management for Butterfly Sparse Matrix Multiplication
- URL: http://arxiv.org/abs/2405.15013v1
- Date: Thu, 23 May 2024 19:36:10 GMT
- Title: Make Inference Faster: Efficient GPU Memory Management for Butterfly Sparse Matrix Multiplication
- Authors: Antoine Gonon, Léon Zheng, Pascal Carrivain, Quoc-Tung Le,
- Abstract summary: This paper is the first to assess the state of existing sparse matrix multiplication algorithms on GPU for the butterfly structure.
We show that these memory operations can be optimized by introducing a new kernel.
We also demonstrate the broader significance of our results by showing how the new kernel can speed up the inference of neural networks.
- Score: 4.387337528923525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper is the first to assess the state of existing sparse matrix multiplication algorithms on GPU for the butterfly structure, a promising form of sparsity. This is achieved through a comprehensive benchmark that can be easily modified to add a new implementation. The goal is to provide a simple tool for users to select the optimal implementation based on their settings. Using this benchmark, we find that existing implementations spend up to 50% of their total runtime on memory rewriting operations. We show that these memory operations can be optimized by introducing a new CUDA kernel that minimizes the transfers between the different levels of GPU memory, achieving a median speed-up factor of x1.4 while also reducing energy consumption (median of x0.85). We also demonstrate the broader significance of our results by showing how the new kernel can speed up the inference of neural networks.
Related papers
- vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Im2win: An Efficient Convolution Paradigm on GPU [1.9162301033784574]
This paper proposes a paradigm on convolution-based convolutions called im2win, which only reduces memory footprint but also offers continuous memory accesses.
We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution, and six$$ DNN-based convolution implementations, with twelve state-of-the-art benchmarks.
arXiv Detail & Related papers (2023-06-25T19:09:56Z) - HEAT: A Highly Efficient and Affordable Training System for
Collaborative Filtering Based Recommendation on CPUs [11.007606356081435]
Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation.
There is no work that optimized SimpleX on multi-core CPUs, leading to limited performance.
We propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs.
arXiv Detail & Related papers (2023-04-14T18:07:26Z) - Pex: Memory-efficient Microcontroller Deep Learning through Partial
Execution [11.336229510791481]
We discuss a novel execution paradigm for microcontroller deep learning.
It modifies the execution of neural networks to avoid materialising full buffers in memory.
This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time.
arXiv Detail & Related papers (2022-11-30T18:47:30Z) - Distributed Out-of-Memory NMF on CPU/GPU Architectures [1.0051474951635875]
We propose an efficient out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for HPC systems.
Benchmark results show significant improvement of 32X to 76x speedup with the new implementation using GPU over the CPU-based NMFk.
arXiv Detail & Related papers (2022-02-19T03:49:21Z) - Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality.
A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent.
We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z) - Rethinking Space-Time Networks with Improved Memory Coverage for
Efficient Video Object Segmentation [68.45737688496654]
We establish correspondences directly between frames without re-encoding the mask features for every object.
With the correspondences, every node in the current query frame is inferred by aggregating features from the past in an associative fashion.
We validated that every memory node now has a chance to contribute, and experimentally showed that such diversified voting is beneficial to both memory efficiency and inference accuracy.
arXiv Detail & Related papers (2021-06-09T16:50:57Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - Sparse GPU Kernels for Deep Learning [24.94153856081836]
Deep learning applications have relatively moderate levels of sparsity that are not sufficient for existing sparse kernels to outperform their dense counterparts.
We develop high-performance GPU kernels for two sparse matrix operations widely applicable in neural networks.
Using our kernels, we demonstrate sparse Transformer and MobileNet models that achieve 1.2-2.1x speedups and up to 12.8x memory savings without sacrificing accuracy.
arXiv Detail & Related papers (2020-06-18T23:59:11Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.