SMASH: Sparse Matrix Atomic Scratchpad Hashing
- URL: http://arxiv.org/abs/2105.14156v1
- Date: Sat, 29 May 2021 00:22:50 GMT
- Title: SMASH: Sparse Matrix Atomic Scratchpad Hashing
- Authors: Kaustubh Shivdikar
- Abstract summary: In this thesis, we introduce a novel SpGEMM kernel implementation based on the row-wise product approach.
We leverage atomic instructions to merge intermediate partial products as they are generated.
Our kernel can achieve 9.4x speedup as compared to competing approaches.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse matrices, more specifically SpGEMM kernels, are commonly found in a
wide range of applications, spanning graph-based path-finding to machine
learning algorithms (e.g., neural networks). A particular challenge in
implementing SpGEMM kernels has been the pressure placed on DRAM memory. One
approach to tackle this problem is to use an inner product method for the
SpGEMM kernel implementation. While the inner product produces fewer
intermediate results, it can end up saturating the memory bandwidth, given the
high number of redundant fetches of the input matrix elements. Using an outer
product-based SpGEMM kernel can reduce redundant fetches, but at the cost of
increased overhead due to extra computation and memory accesses for
producing/managing partial products.
In this thesis, we introduce a novel SpGEMM kernel implementation based on
the row-wise product approach. We leverage atomic instructions to merge
intermediate partial products as they are generated. The use of atomic
instructions eliminates the need to create partial product matrices.
To evaluate our row-wise product approach, we map an optimized SpGEMM kernel
to a custom accelerator designed to accelerate graph-based applications. The
targeted accelerator is an experimental system named PIUMA, being developed by
Intel. PIUMA provides several attractive features, including fast context
switching, user-configurable caches, globally addressable memory, non-coherent
caches, and asynchronous pipelines. We tailor our SpGEMM kernel to exploit many
of the features of the PIUMA fabric.
This thesis compares our SpGEMM implementation against prior solutions, all
mapped to the PIUMA framework. We briefly describe some of the PIUMA
architecture features and then delve into the details of our optimized SpGEMM
kernel. Our SpGEMM kernel can achieve 9.4x speedup as compared to competing
approaches.
Related papers
- Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage.
To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices.
A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv Detail & Related papers (2024-11-04T04:58:20Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel
Max Series GPU [0.0]
We focus on three matrix operations relevant for machine learning applications.
We develop optimized implementations for SPMM, SDDMM, and FusedMM operations utilizing Intel oneAPI's Explicit SIMD (ESIMD) SYCL extension API.
arXiv Detail & Related papers (2023-11-01T08:43:59Z) - Accelerating Machine Learning Primitives on Commodity Hardware [0.0]
We present an extensive study of the Sliding Window convolution technique as a more efficient alternative to the commonly used General Matrix multiplication (GEMM) based convolution in Deep Neural Networks (DNNs)
Our results suggest that the Sliding Window computation kernels can outperform GEMM-based convolution on a CPU and even on dedicated hardware accelerators.
This could promote a wider adoption of AI on low-power and low-memory devices without the need for specialized hardware.
arXiv Detail & Related papers (2023-10-08T16:26:18Z) - Spectrum-guided Multi-granularity Referring Video Object Segmentation [56.95836951559529]
Current referring video object segmentation (R-VOS) techniques extract conditional kernels from encoded (low-resolution) vision-language features to segment the decoded high-resolution features.
This causes significant feature drift, which the segmentation kernels struggle to perceive during the forward computation.
We propose a Spectrum-guided Multi-granularity approach, which performs direct segmentation on the encoded features and employs visual details to further optimize the masks.
arXiv Detail & Related papers (2023-07-25T14:35:25Z) - MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware.
Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters.
We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z) - Distributed-Memory Sparse Kernels for Machine Learning [1.5050487967966784]
We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SDDMM.
We give two communication-eliding strategies to reduce costs further for FusedMM kernels.
We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using Erdos-Renyi random matrices and large real-world sparse matrices.
arXiv Detail & Related papers (2022-03-15T06:34:39Z) - SKIing on Simplices: Kernel Interpolation on the Permutohedral Lattice
for Scalable Gaussian Processes [39.821400341226315]
Structured Kernel Interpolation (SKI) framework is used to perform efficient matrix vector multiplies (MVMs) on a grid.
We develop a connection between SKI and the permutohedral lattice used for high-dimensional fast bilateral filtering.
Using a sparse simplicial grid instead of a dense rectangular one, we can perform GP inference exponentially faster in the dimension than SKI.
We additionally provide a implementation of Simplex-GP, which enables significant acceleration of MVM based inference.
arXiv Detail & Related papers (2021-06-12T06:04:56Z) - Dynamic Probabilistic Pruning: A general framework for
hardware-constrained pruning at different granularities [80.06422693778141]
We propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps)
We refer to this algorithm as Dynamic Probabilistic Pruning (DPP)
We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification.
arXiv Detail & Related papers (2021-05-26T17:01:52Z) - FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph
Neural Networks [3.577310844634503]
We develop a fused matrix multiplication kernel that unifies sampled dense-dense matrix multiplication and sparse-dense matrix multiplication under a single operation called FusedMM.
By using user-defined functions, FusedMM can capture almost all computational patterns needed by popular graph embedding and GNN approaches.
arXiv Detail & Related papers (2020-11-07T18:06:57Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.