Minuet: Accelerating 3D Sparse Convolutions on GPUs
- URL: http://arxiv.org/abs/2401.06145v1
- Date: Fri, 1 Dec 2023 05:09:02 GMT
- Title: Minuet: Accelerating 3D Sparse Convolutions on GPUs
- Authors: Jiacheng Yang, Christina Giannoula, Jun Wu, Mostafa Elhoushi, James
Gleeson, Gennady Pekhimenko
- Abstract summary: Sparse Convolution (SC) is widely used for processing 3D point clouds that are inherently sparse.
In this work, we analyze the shortcomings of prior state-of-the-art SC engines, and propose Minuet, a novel memory-efficient SC engine tailored for modern GPUs.
Our evaluations show that Minuet significantly outperforms prior SC engines by on average $1.74times$ (up to $2.22times$) for end-to-end point cloud network executions.
- Score: 9.54287796030519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse Convolution (SC) is widely used for processing 3D point clouds that
are inherently sparse. Different from dense convolution, SC preserves the
sparsity of the input point cloud by only allowing outputs to specific
locations. To efficiently compute SC, prior SC engines first use hash tables to
build a kernel map that stores the necessary General Matrix Multiplication
(GEMM) operations to be executed (Map step), and then use a Gather-GEMM-Scatter
process to execute these GEMM operations (GMaS step). In this work, we analyze
the shortcomings of prior state-of-the-art SC engines, and propose Minuet, a
novel memory-efficient SC engine tailored for modern GPUs. Minuet proposes to
(i) replace the hash tables used in the Map step with a novel segmented sorting
double-traversed binary search algorithm that highly utilizes the on-chip
memory hierarchy of GPUs, (ii) use a lightweight scheme to autotune the tile
size in the Gather and Scatter operations of the GMaS step, such that to adapt
the execution to the particular characteristics of each SC layer, dataset, and
GPU architecture, and (iii) employ a padding-efficient GEMM grouping approach
that reduces both memory padding and kernel launching overheads. Our
evaluations show that Minuet significantly outperforms prior SC engines by on
average $1.74\times$ (up to $2.22\times$) for end-to-end point cloud network
executions. Our novel segmented sorting double-traversed binary search
algorithm achieves superior speedups by $15.8\times$ on average (up to
$26.8\times$) over prior SC engines in the Map step. The source code of Minuet
is publicly available at https://github.com/UofT-EcoSystem/Minuet.
Related papers
- Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - 3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt [65.25603275491544]
We present 3DGS-LM, a new method that accelerates the reconstruction of 3D Gaussian Splatting (3DGS)
Our method is 30% faster than the original 3DGS while obtaining the same reconstruction quality optimization.
arXiv Detail & Related papers (2024-09-19T16:31:44Z) - Quick Adaptive Ternary Segmentation: An Efficient Decoding Procedure For
Hidden Markov Models [70.26374282390401]
Decoding the original signal (i.e., hidden chain) from the noisy observations is one of the main goals in nearly all HMM based data analyses.
We present Quick Adaptive Ternary (QATS), a divide-and-conquer procedure which decodes the hidden sequence in polylogarithmic computational complexity.
arXiv Detail & Related papers (2023-05-29T19:37:48Z) - Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras.
Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation.
We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z) - Discrete Morse Sandwich: Fast Computation of Persistence Diagrams for
Scalar Data -- An Algorithm and A Benchmark [8.648433479399857]
This paper introduces an efficient algorithm for persistence diagram computation, given an input piecewise linear scalar field f defined on a d-dimensional simplicial complex K.
We express this algorithm within the setting of discrete Morse theory, which considerably reduces the number of input simplices to consider.
We also introduce a stratification approach to the problem, that we call "sandwiching"
arXiv Detail & Related papers (2022-06-27T10:54:24Z) - Distributed Out-of-Memory NMF on CPU/GPU Architectures [1.0051474951635875]
We propose an efficient out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for HPC systems.
Benchmark results show significant improvement of 32X to 76x speedup with the new implementation using GPU over the CPU-based NMFk.
arXiv Detail & Related papers (2022-02-19T03:49:21Z) - GPU-accelerated Faster Mean Shift with euclidean distance metrics [1.3507758562554621]
Mean-shift algorithm is widely used to solve clustering problems.
In previous research, we proposed a novel GPU-accelerated Faster Mean-shift algorithm.
In this study, we extend and improve the previous algorithm to handle Euclidean distance metrics.
arXiv Detail & Related papers (2021-12-27T20:18:24Z) - ASH: A Modern Framework for Parallel Spatial Hashing in 3D Perception [91.24236600199542]
ASH is a modern and high-performance framework for parallel spatial hashing on GPU.
ASH achieves higher performance, supports richer functionality, and requires fewer lines of code.
ASH and its example applications are open sourced in Open3D.
arXiv Detail & Related papers (2021-10-01T16:25:40Z) - VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator.
textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z) - Mind Mappings: Enabling Efficient Algorithm-Accelerator Mapping Space
Search [7.596028906226877]
This paper presents a novel Mind-based search foraccel-based search space.
We derive differentiable approximations to the otherwise nonsmooth$optimal mapping space.
With a differentiable approximation, we can compare high-bound efficient algorithms to find Mind-based search schemes.
arXiv Detail & Related papers (2021-03-02T06:11:58Z) - Faster Mean-shift: GPU-accelerated clustering for cosine embedding-based
cell segmentation and tracking [12.60841328582138]
We propose a novel Faster Mean-shift algorithm, which tackles the computational bottleneck of embedding based cell segmentation and tracking.
The proposed Faster Mean-shift algorithm achieved 7-10 times speedup compared to the state-of-the-art embedding based cell instance segmentation and tracking algorithm.
Our Faster Mean-shift algorithm also achieved the highest computational speed compared to other GPU benchmarks with optimized memory consumption.
arXiv Detail & Related papers (2020-07-28T14:52:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.