Related papers: Efficient Quantized Sparse Matrix Operations on Tensor Cores

Efficient Quantized Sparse Matrix Operations on Tensor Cores

URL: http://arxiv.org/abs/2209.06979v4
Date: Sun, 7 May 2023 04:54:20 GMT
Title: Efficient Quantized Sparse Matrix Operations on Tensor Cores
Authors: Shigang Li, Kazuki Osawa, Torsten Hoefler
Abstract summary: We propose Magicube, a high-performance sparse-matrix library for low-precision integers on cores. We show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with comparable accuracy for end-to-end Transformer inference.
Score: 21.963041375857117
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups from sparse, low-precision matrix operations on Tensor cores, because of the strict requirements for data layout and lack of support for efficiently manipulating the low-precision integers. We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores. Magicube supports SpMM and SDDMM, two major sparse operations in deep learning with mixed precision. Experimental results on an NVIDIA A100 GPU show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with a comparable accuracy for end-to-end sparse Transformer inference.

Related papers

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float [71.43026659686679]
Large Language Models (LLMs) have grown rapidly in size, creating challenges for efficient deployment on resource-constrained hardware. We introduce Dynamic-Length Float (DFloat11), a compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model.
arXiv Detail & Related papers (2025-04-15T22:38:38Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs [20.4238781638402]
Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving, and graph understanding in recommendation systems. Existing GPU libraries offer two dataflow types for sparse convolution. We introduce TorchSparse++, a new GPU library that achieves the best of both worlds.
arXiv Detail & Related papers (2023-10-25T21:02:38Z)
VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores [19.28753465771938]
We show that Spatha, a high-performance sparse-library for Deep Learning routines, achieves up to 37x speedup over cuBLAS. We also demonstrate a second-order pruning technique that enables sparsification to high sparsity ratios with V:N:M and little to no loss in accuracy in modern transformers.
arXiv Detail & Related papers (2023-10-03T14:08:26Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
RSC: Accelerating Graph Neural Networks Training via Randomized Sparse Computations [56.59168541623729]
Training graph neural networks (GNNs) is time consuming because sparse graph-based operations are hard to be accelerated by hardware. We explore trading off the computational precision to reduce the time complexity via sampling-based approximation. We propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations.
arXiv Detail & Related papers (2022-10-19T17:25:33Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
TorchSparse: Efficient Point Cloud Inference Engine [24.541195361633523]
We introduce TorchSparse, a high-performance point cloud inference engine. TorchSparse directly optimize the two bottlenecks of sparse convolution: irregular computation and data movement. It achieves 1.6x and 1.5x measured end-to-end speedup over the state-of-the-art MinkowskiEngine and SpConv, respectively.
arXiv Detail & Related papers (2022-04-21T17:58:30Z)
Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters. It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions. We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)
Improving compute efficacy frontiers with SliceOut [31.864949424541344]
We introduce SliceOut -- a dropout-inspired scheme to train deep learning models faster without impacting final test accuracy. At test time, turning off SliceOut performs an implicit ensembling across a linear number of architectures that preserves test accuracy. This leads to faster processing of large computational workloads overall, and significantly reduce the resulting energy consumption and CO2emissions.
arXiv Detail & Related papers (2020-07-21T15:59:09Z)
Sparse GPU Kernels for Deep Learning [24.94153856081836]
Deep learning applications have relatively moderate levels of sparsity that are not sufficient for existing sparse kernels to outperform their dense counterparts. We develop high-performance GPU kernels for two sparse matrix operations widely applicable in neural networks. Using our kernels, we demonstrate sparse Transformer and MobileNet models that achieve 1.2-2.1x speedups and up to 12.8x memory savings without sacrificing accuracy.
arXiv Detail & Related papers (2020-06-18T23:59:11Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.