Distributed-Memory Sparse Kernels for Machine Learning
- URL: http://arxiv.org/abs/2203.07673v1
- Date: Tue, 15 Mar 2022 06:34:39 GMT
- Title: Distributed-Memory Sparse Kernels for Machine Learning
- Authors: Vivek Bharadwaj, Aydin Bulu\c{c}, James Demmel
- Abstract summary: We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SDDMM.
We give two communication-eliding strategies to reduce costs further for FusedMM kernels.
We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using Erdos-Renyi random matrices and large real-world sparse matrices.
- Score: 1.5050487967966784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times
Dense Matrix Multiplication (SpMM) appear in diverse settings, such as
collaborative filtering, document clustering, and graph embedding. Frequently,
the SDDMM output becomes the input sparse matrix for a subsequent SpMM
operation. Existing work has focused on shared memory parallelization of these
primitives. While there has been extensive analysis of communication-minimizing
distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the
back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that
distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to
algorithms for SDDMM with identical communication costs and input / output data
layouts. Further, we give two communication-eliding strategies to reduce costs
further for FusedMM kernels: either reusing the replication of an input dense
matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM
kernels.
We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using
Erdos-Renyi random matrices and large real-world sparse matrices. On 256 nodes
with 68 cores each, 1.5D FusedMM algorithms using either communication eliding
approach can save at least 30% of time spent exclusively in communication
compared to executing a distributed-memory SpMM and SDDMM kernel in sequence.
On real-world matrices with hundreds of millions of edges, all of our
algorithms exhibit at least a 10x speedup over the SpMM algorithm in PETSc. On
these matrices, our communication-eliding techniques exhibit runtimes up to 1.6
times faster than an unoptimized sequence of SDDMM and SpMM. We embed and test
the scaling of our algorithms in real-world applications, including
collaborative filtering via alternating-least-squares and inference for
attention-based graph neural networks.
Related papers
- Sublinear Variational Optimization of Gaussian Mixture Models with Millions to Billions of Parameters [5.429282997550318]
We train GMMs with over 10 billion parameters on about 100 million images, and observe training times of approximately nine hours on a single state-of-the-art CPU.
Our proposed algorithm significantly reduces runtime complexity per iteration from $mathcalO(NCD2)$ to a complexity scaling linearly with $D$ and remaining constant w.r.t.
As a proof of concept, we train GMMs with over 10 billion parameters on about 100 million images, and observe training times of approximately nine hours on a single state-of-the-art CPU.
arXiv Detail & Related papers (2025-01-21T17:11:25Z) - FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores [6.404201720333765]
We propose FlashSparse, a novel approach to bridge the gap between sparse workloads and the TCU architecture.
Specifically, FlashSparse minimizes the sparse granularity for SpMM and SDDMM on TCUs through a novel swap-and-transpose matrix multiplication strategy.
We show that FlashSparse sets a new state-of-the-art for sparse matrix multiplications (geometric mean 5.5x speedup over DTC-SpMM and 3.22x speedup over RoDe)
arXiv Detail & Related papers (2024-12-15T01:12:33Z) - Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel
Max Series GPU [0.0]
We focus on three matrix operations relevant for machine learning applications.
We develop optimized implementations for SPMM, SDDMM, and FusedMM operations utilizing Intel oneAPI's Explicit SIMD (ESIMD) SYCL extension API.
arXiv Detail & Related papers (2023-11-01T08:43:59Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Over-the-Air Split Machine Learning in Wireless MIMO Networks [56.27831295707334]
In split machine learning (ML), different partitions of a neural network (NN) are executed by different computing nodes.
To ease communication burden, over-the-air computation (OAC) can efficiently implement all or part of the computation at the same time of communication.
arXiv Detail & Related papers (2022-10-07T15:39:11Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - TaSPM: Targeted Sequential Pattern Mining [53.234101208024335]
We propose a generic framework namely TaSPM, based on the fast CM-SPAM algorithm.
We also propose several pruning strategies to reduce meaningless operations in mining processes.
Experiments show that the novel targeted mining algorithm TaSPM can achieve faster running time and less memory consumption.
arXiv Detail & Related papers (2022-02-26T17:49:47Z) - SMASH: Sparse Matrix Atomic Scratchpad Hashing [0.0]
In this thesis, we introduce a novel SpGEMM kernel implementation based on the row-wise product approach.
We leverage atomic instructions to merge intermediate partial products as they are generated.
Our kernel can achieve 9.4x speedup as compared to competing approaches.
arXiv Detail & Related papers (2021-05-29T00:22:50Z) - Image Modeling with Deep Convolutional Gaussian Mixture Models [79.0660895390689]
We present a new formulation of deep hierarchical Gaussian Mixture Models (GMMs) that is suitable for describing and generating images.
DCGMMs avoid this by a stacked architecture of multiple GMM layers, linked by convolution and pooling operations.
For generating sharp images with DCGMMs, we introduce a new gradient-based technique for sampling through non-invertible operations like convolution and pooling.
Based on the MNIST and FashionMNIST datasets, we validate the DCGMMs model by demonstrating its superiority over flat GMMs for clustering, sampling and outlier detection.
arXiv Detail & Related papers (2021-04-19T12:08:53Z) - FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph
Neural Networks [3.577310844634503]
We develop a fused matrix multiplication kernel that unifies sampled dense-dense matrix multiplication and sparse-dense matrix multiplication under a single operation called FusedMM.
By using user-defined functions, FusedMM can capture almost all computational patterns needed by popular graph embedding and GNN approaches.
arXiv Detail & Related papers (2020-11-07T18:06:57Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.