Distributed-Memory Sparse Kernels for Machine Learning
- URL: http://arxiv.org/abs/2203.07673v1
- Date: Tue, 15 Mar 2022 06:34:39 GMT
- Title: Distributed-Memory Sparse Kernels for Machine Learning
- Authors: Vivek Bharadwaj, Aydin Bulu\c{c}, James Demmel
- Abstract summary: We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SDDMM.
We give two communication-eliding strategies to reduce costs further for FusedMM kernels.
We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using Erdos-Renyi random matrices and large real-world sparse matrices.
- Score: 1.5050487967966784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times
Dense Matrix Multiplication (SpMM) appear in diverse settings, such as
collaborative filtering, document clustering, and graph embedding. Frequently,
the SDDMM output becomes the input sparse matrix for a subsequent SpMM
operation. Existing work has focused on shared memory parallelization of these
primitives. While there has been extensive analysis of communication-minimizing
distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the
back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that
distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to
algorithms for SDDMM with identical communication costs and input / output data
layouts. Further, we give two communication-eliding strategies to reduce costs
further for FusedMM kernels: either reusing the replication of an input dense
matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM
kernels.
We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using
Erdos-Renyi random matrices and large real-world sparse matrices. On 256 nodes
with 68 cores each, 1.5D FusedMM algorithms using either communication eliding
approach can save at least 30% of time spent exclusively in communication
compared to executing a distributed-memory SpMM and SDDMM kernel in sequence.
On real-world matrices with hundreds of millions of edges, all of our
algorithms exhibit at least a 10x speedup over the SpMM algorithm in PETSc. On
these matrices, our communication-eliding techniques exhibit runtimes up to 1.6
times faster than an unoptimized sequence of SDDMM and SpMM. We embed and test
the scaling of our algorithms in real-world applications, including
collaborative filtering via alternating-least-squares and inference for
attention-based graph neural networks.
Related papers
- SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution [4.14360329494344]
We present a novel approach for accelerating convolutions during inference for CPU-based architectures.
Our experiments with commonly used network architectures demonstrate a significant speedup compared to existing indirect methods.
arXiv Detail & Related papers (2024-11-23T21:43:38Z) - Deep Gaussian mixture model for unsupervised image segmentation [1.3654846342364308]
In many tasks sufficient pixel-level labels are very difficult to obtain.
We propose a method which combines a Gaussian mixture model (GMM) with unsupervised deep learning techniques.
We demonstrate the advantages of our method in various experiments on the example of infarct segmentation on multi-sequence MRI images.
arXiv Detail & Related papers (2024-04-18T15:20:59Z) - Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel
Max Series GPU [0.0]
We focus on three matrix operations relevant for machine learning applications.
We develop optimized implementations for SPMM, SDDMM, and FusedMM operations utilizing Intel oneAPI's Explicit SIMD (ESIMD) SYCL extension API.
arXiv Detail & Related papers (2023-11-01T08:43:59Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Over-the-Air Split Machine Learning in Wireless MIMO Networks [56.27831295707334]
In split machine learning (ML), different partitions of a neural network (NN) are executed by different computing nodes.
To ease communication burden, over-the-air computation (OAC) can efficiently implement all or part of the computation at the same time of communication.
arXiv Detail & Related papers (2022-10-07T15:39:11Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - TaSPM: Targeted Sequential Pattern Mining [53.234101208024335]
We propose a generic framework namely TaSPM, based on the fast CM-SPAM algorithm.
We also propose several pruning strategies to reduce meaningless operations in mining processes.
Experiments show that the novel targeted mining algorithm TaSPM can achieve faster running time and less memory consumption.
arXiv Detail & Related papers (2022-02-26T17:49:47Z) - SMASH: Sparse Matrix Atomic Scratchpad Hashing [0.0]
In this thesis, we introduce a novel SpGEMM kernel implementation based on the row-wise product approach.
We leverage atomic instructions to merge intermediate partial products as they are generated.
Our kernel can achieve 9.4x speedup as compared to competing approaches.
arXiv Detail & Related papers (2021-05-29T00:22:50Z) - Image Modeling with Deep Convolutional Gaussian Mixture Models [79.0660895390689]
We present a new formulation of deep hierarchical Gaussian Mixture Models (GMMs) that is suitable for describing and generating images.
DCGMMs avoid this by a stacked architecture of multiple GMM layers, linked by convolution and pooling operations.
For generating sharp images with DCGMMs, we introduce a new gradient-based technique for sampling through non-invertible operations like convolution and pooling.
Based on the MNIST and FashionMNIST datasets, we validate the DCGMMs model by demonstrating its superiority over flat GMMs for clustering, sampling and outlier detection.
arXiv Detail & Related papers (2021-04-19T12:08:53Z) - FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph
Neural Networks [3.577310844634503]
We develop a fused matrix multiplication kernel that unifies sampled dense-dense matrix multiplication and sparse-dense matrix multiplication under a single operation called FusedMM.
By using user-defined functions, FusedMM can capture almost all computational patterns needed by popular graph embedding and GNN approaches.
arXiv Detail & Related papers (2020-11-07T18:06:57Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.