Related papers: Sparse GPU Kernels for Deep Learning

Sparse GPU Kernels for Deep Learning

URL: http://arxiv.org/abs/2006.10901v2
Date: Mon, 31 Aug 2020 18:35:07 GMT
Title: Sparse GPU Kernels for Deep Learning
Authors: Trevor Gale, Matei Zaharia, Cliff Young, Erich Elsen
Abstract summary: Deep learning applications have relatively moderate levels of sparsity that are not sufficient for existing sparse kernels to outperform their dense counterparts. We develop high-performance GPU kernels for two sparse matrix operations widely applicable in neural networks. Using our kernels, we demonstrate sparse Transformer and MobileNet models that achieve 1.2-2.1x speedups and up to 12.8x memory savings without sacrificing accuracy.
Score: 24.94153856081836
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scientific workloads have traditionally exploited high levels of sparsity to accelerate computation and reduce memory requirements. While deep neural networks can be made sparse, achieving practical speedups on GPUs is difficult because these applications have relatively moderate levels of sparsity that are not sufficient for existing sparse kernels to outperform their dense counterparts. In this work, we study sparse matrices from deep learning applications and identify favorable properties that can be exploited to accelerate computation. Based on these insights, we develop high-performance GPU kernels for two sparse matrix operations widely applicable in neural networks: sparse matrix-dense matrix multiplication and sampled dense-dense matrix multiplication. Our kernels reach 27% of single-precision peak on Nvidia V100 GPUs. Using our kernels, we demonstrate sparse Transformer and MobileNet models that achieve 1.2-2.1x speedups and up to 12.8x memory savings without sacrificing accuracy.

Related papers

Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication [6.557224606759151]
Modern accelerators are commonly equipped with cores and cores to accelerate sparse operators.<n>We show that utilizing one resource alone leads to inferior performance for sparse matrix multiplication, due to their respective limitations.<n>We propose a 2D-aware workload computation strategy find out the sweet point of task mapping operators, leveraging both the high performance of 2.9 cores and the low redundancy on cores.
arXiv Detail & Related papers (2025-06-28T01:50:13Z)
TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs [20.4238781638402]
Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving, and graph understanding in recommendation systems. Existing GPU libraries offer two dataflow types for sparse convolution. We introduce TorchSparse++, a new GPU library that achieves the best of both worlds.
arXiv Detail & Related papers (2023-10-25T21:02:38Z)
PopSparse: Accelerated block sparse matrix multiplication on IPU [0.5661403709207713]
We introduce PopSparse, a library that enables fast sparse operations on Graphcore IPUs. We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run. Results indicate that the PopSparse implementations are faster than dense matrix multiplications on IPU at a range of sparsity levels.
arXiv Detail & Related papers (2023-03-29T20:00:19Z)
RSC: Accelerating Graph Neural Networks Training via Randomized Sparse Computations [56.59168541623729]
Training graph neural networks (GNNs) is time consuming because sparse graph-based operations are hard to be accelerated by hardware. We explore trading off the computational precision to reduce the time complexity via sampling-based approximation. We propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations.
arXiv Detail & Related papers (2022-10-19T17:25:33Z)
Efficient Quantized Sparse Matrix Operations on Tensor Cores [21.963041375857117]
We propose Magicube, a high-performance sparse-matrix library for low-precision integers on cores. We show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with comparable accuracy for end-to-end Transformer inference.
arXiv Detail & Related papers (2022-09-14T23:52:13Z)
AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning [1.5301777464637454]
AxoNN is a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU. By using the CPU memory as a scratch space for offloading data periodically during training, AxoNN is able to reduce GPU memory consumption by four times.
arXiv Detail & Related papers (2021-10-25T14:43:36Z)
VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator. textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z)
Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters. It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions. We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)
Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z)
When deep learning models on GPU can be accelerated by taking advantage of unstructured sparsity [0.0]
This paper is focused on the improvement the efficiency of the sparse convolutional neural networks (CNNs) layers on graphic processing units ( GPU) The modern CNN models need megabytes of coefficients and needed millions MAC operations to perform convolution. We show when is worth using a direct sparse operation to speed-up the calculation of the convolution layers.
arXiv Detail & Related papers (2020-11-12T10:13:48Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels. We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.