FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph
Neural Networks
- URL: http://arxiv.org/abs/2011.06391v2
- Date: Wed, 27 Oct 2021 01:35:27 GMT
- Title: FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph
Neural Networks
- Authors: Md. Khaledur Rahman, Majedul Haque Sujon and Ariful Azad
- Abstract summary: We develop a fused matrix multiplication kernel that unifies sampled dense-dense matrix multiplication and sparse-dense matrix multiplication under a single operation called FusedMM.
By using user-defined functions, FusedMM can capture almost all computational patterns needed by popular graph embedding and GNN approaches.
- Score: 3.577310844634503
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We develop a fused matrix multiplication kernel that unifies sampled
dense-dense matrix multiplication and sparse-dense matrix multiplication under
a single operation called FusedMM. By using user-defined functions, FusedMM can
capture almost all computational patterns needed by popular graph embedding and
GNN approaches. FusedMM is an order of magnitude faster than its equivalent
kernels in Deep Graph Library. The superior performance of FusedMM comes from
the low-level vectorized kernels, a suitable load balancing scheme and an
efficient utilization of the memory bandwidth. FusedMM can tune its performance
using a code generator and perform equally well on Intel, AMD and ARM
processors. FusedMM speeds up an end-to-end graph embedding algorithm by up to
28x on different processors.
Related papers
- SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution [4.14360329494344]
We present a novel approach for accelerating convolutions during inference for CPU-based architectures.
Our experiments with commonly used network architectures demonstrate a significant speedup compared to existing indirect methods.
arXiv Detail & Related papers (2024-11-23T21:43:38Z) - Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel
Max Series GPU [0.0]
We focus on three matrix operations relevant for machine learning applications.
We develop optimized implementations for SPMM, SDDMM, and FusedMM operations utilizing Intel oneAPI's Explicit SIMD (ESIMD) SYCL extension API.
arXiv Detail & Related papers (2023-11-01T08:43:59Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Over-the-Air Split Machine Learning in Wireless MIMO Networks [56.27831295707334]
In split machine learning (ML), different partitions of a neural network (NN) are executed by different computing nodes.
To ease communication burden, over-the-air computation (OAC) can efficiently implement all or part of the computation at the same time of communication.
arXiv Detail & Related papers (2022-10-07T15:39:11Z) - Distributed-Memory Sparse Kernels for Machine Learning [1.5050487967966784]
We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SDDMM.
We give two communication-eliding strategies to reduce costs further for FusedMM kernels.
We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using Erdos-Renyi random matrices and large real-world sparse matrices.
arXiv Detail & Related papers (2022-03-15T06:34:39Z) - Distributed Out-of-Memory NMF on CPU/GPU Architectures [1.0051474951635875]
We propose an efficient out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for HPC systems.
Benchmark results show significant improvement of 32X to 76x speedup with the new implementation using GPU over the CPU-based NMFk.
arXiv Detail & Related papers (2022-02-19T03:49:21Z) - Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality.
A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent.
We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z) - SMASH: Sparse Matrix Atomic Scratchpad Hashing [0.0]
In this thesis, we introduce a novel SpGEMM kernel implementation based on the row-wise product approach.
We leverage atomic instructions to merge intermediate partial products as they are generated.
Our kernel can achieve 9.4x speedup as compared to competing approaches.
arXiv Detail & Related papers (2021-05-29T00:22:50Z) - VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator.
textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z) - Image Modeling with Deep Convolutional Gaussian Mixture Models [79.0660895390689]
We present a new formulation of deep hierarchical Gaussian Mixture Models (GMMs) that is suitable for describing and generating images.
DCGMMs avoid this by a stacked architecture of multiple GMM layers, linked by convolution and pooling operations.
For generating sharp images with DCGMMs, we introduce a new gradient-based technique for sampling through non-invertible operations like convolution and pooling.
Based on the MNIST and FashionMNIST datasets, we validate the DCGMMs model by demonstrating its superiority over flat GMMs for clustering, sampling and outlier detection.
arXiv Detail & Related papers (2021-04-19T12:08:53Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.