Related papers: Minuet: Accelerating 3D Sparse Convolutions on GPUs

Minuet: Accelerating 3D Sparse Convolutions on GPUs

URL: http://arxiv.org/abs/2401.06145v1
Date: Fri, 1 Dec 2023 05:09:02 GMT
Title: Minuet: Accelerating 3D Sparse Convolutions on GPUs
Authors: Jiacheng Yang, Christina Giannoula, Jun Wu, Mostafa Elhoushi, James Gleeson, Gennady Pekhimenko
Abstract summary: Sparse Convolution (SC) is widely used for processing 3D point clouds that are inherently sparse. In this work, we analyze the shortcomings of prior state-of-the-art SC engines, and propose Minuet, a novel memory-efficient SC engine tailored for modern GPUs. Our evaluations show that Minuet significantly outperforms prior SC engines by on average $1.74times$ (up to $2.22times$) for end-to-end point cloud network executions.
Score: 9.54287796030519
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse Convolution (SC) is widely used for processing 3D point clouds that are inherently sparse. Different from dense convolution, SC preserves the sparsity of the input point cloud by only allowing outputs to specific locations. To efficiently compute SC, prior SC engines first use hash tables to build a kernel map that stores the necessary General Matrix Multiplication (GEMM) operations to be executed (Map step), and then use a Gather-GEMM-Scatter process to execute these GEMM operations (GMaS step). In this work, we analyze the shortcomings of prior state-of-the-art SC engines, and propose Minuet, a novel memory-efficient SC engine tailored for modern GPUs. Minuet proposes to (i) replace the hash tables used in the Map step with a novel segmented sorting double-traversed binary search algorithm that highly utilizes the on-chip memory hierarchy of GPUs, (ii) use a lightweight scheme to autotune the tile size in the Gather and Scatter operations of the GMaS step, such that to adapt the execution to the particular characteristics of each SC layer, dataset, and GPU architecture, and (iii) employ a padding-efficient GEMM grouping approach that reduces both memory padding and kernel launching overheads. Our evaluations show that Minuet significantly outperforms prior SC engines by on average $1.74\times$ (up to $2.22\times$) for end-to-end point cloud network executions. Our novel segmented sorting double-traversed binary search algorithm achieves superior speedups by $15.8\times$ on average (up to $26.8\times$) over prior SC engines in the Map step. The source code of Minuet is publicly available at https://github.com/UofT-EcoSystem/Minuet.

Related papers

AIRES: Accelerating Out-of-Core GCNs via Algorithm-System Co-Design [6.554916179445241]
Graph convolutional networks (GCNs) are fundamental in various scientific applications, ranging from biomedical protein-protein interactions (PPI) to large-scale recommendation systems.<n>An essential component for modeling graph structures in GCNs is sparse general matrix-matrix multiplication (SpGEMM)<n>SpGEMMs are often conducted in an out-of-core fashion due to limited GPU memory space in resource-constrained systems.<n>We propose AIRES, a novel algorithm-system co-design solution to accelerate out-of-core SpGEMM computation for GCNs.
arXiv Detail & Related papers (2025-07-02T00:35:43Z)
Fused3S: Fast Sparse Attention on Tensor Cores [3.6068301267188]
This paper introduces Fused3S, the first fused 3S algorithm that jointly maximizes tensor core utilization and minimizes data movement.<n>Across real-world graph datasets, Fused3S $1.6- 16.3times$ and $1.5-14times$ speedup over state-of-the-art on H100 and A30 GPU.
arXiv Detail & Related papers (2025-05-12T22:09:05Z)
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks. We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z)
3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt [65.25603275491544]
We present 3DGS-LM, a new method that accelerates the reconstruction of 3D Gaussian Splatting (3DGS) Our method is 30% faster than the original 3DGS while obtaining the same reconstruction quality optimization.
arXiv Detail & Related papers (2024-09-19T16:31:44Z)
Quick Adaptive Ternary Segmentation: An Efficient Decoding Procedure For Hidden Markov Models [70.26374282390401]
Decoding the original signal (i.e., hidden chain) from the noisy observations is one of the main goals in nearly all HMM based data analyses. We present Quick Adaptive Ternary (QATS), a divide-and-conquer procedure which decodes the hidden sequence in polylogarithmic computational complexity.
arXiv Detail & Related papers (2023-05-29T19:37:48Z)
Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras. Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation. We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z)
Discrete Morse Sandwich: Fast Computation of Persistence Diagrams for Scalar Data -- An Algorithm and A Benchmark [8.648433479399857]
This paper introduces an efficient algorithm for persistence diagram computation, given an input piecewise linear scalar field f defined on a d-dimensional simplicial complex K. We express this algorithm within the setting of discrete Morse theory, which considerably reduces the number of input simplices to consider. We also introduce a stratification approach to the problem, that we call "sandwiching"
arXiv Detail & Related papers (2022-06-27T10:54:24Z)
Distributed Out-of-Memory NMF on CPU/GPU Architectures [1.0051474951635875]
We propose an efficient out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for HPC systems. Benchmark results show significant improvement of 32X to 76x speedup with the new implementation using GPU over the CPU-based NMFk.
arXiv Detail & Related papers (2022-02-19T03:49:21Z)
GPU-accelerated Faster Mean Shift with euclidean distance metrics [1.3507758562554621]
Mean-shift algorithm is widely used to solve clustering problems. In previous research, we proposed a novel GPU-accelerated Faster Mean-shift algorithm. In this study, we extend and improve the previous algorithm to handle Euclidean distance metrics.
arXiv Detail & Related papers (2021-12-27T20:18:24Z)
ASH: A Modern Framework for Parallel Spatial Hashing in 3D Perception [91.24236600199542]
ASH is a modern and high-performance framework for parallel spatial hashing on GPU. ASH achieves higher performance, supports richer functionality, and requires fewer lines of code. ASH and its example applications are open sourced in Open3D.
arXiv Detail & Related papers (2021-10-01T16:25:40Z)
VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator. textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z)
Mind Mappings: Enabling Efficient Algorithm-Accelerator Mapping Space Search [7.596028906226877]
This paper presents a novel Mind-based search foraccel-based search space. We derive differentiable approximations to the otherwise nonsmooth$optimal mapping space. With a differentiable approximation, we can compare high-bound efficient algorithms to find Mind-based search schemes.
arXiv Detail & Related papers (2021-03-02T06:11:58Z)
Faster Mean-shift: GPU-accelerated clustering for cosine embedding-based cell segmentation and tracking [12.60841328582138]
We propose a novel Faster Mean-shift algorithm, which tackles the computational bottleneck of embedding based cell segmentation and tracking. The proposed Faster Mean-shift algorithm achieved 7-10 times speedup compared to the state-of-the-art embedding based cell instance segmentation and tracking algorithm. Our Faster Mean-shift algorithm also achieved the highest computational speed compared to other GPU benchmarks with optimized memory consumption.
arXiv Detail & Related papers (2020-07-28T14:52:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.