Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs
- URL: http://arxiv.org/abs/2407.20496v1
- Date: Tue, 30 Jul 2024 01:40:50 GMT
- Title: Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs
- Authors: Seungmin Yu, Xiaodie Yi, Hayun Lee, Dongkun Shin,
- Abstract summary: N:M sparsity pruning is a powerful technique for compressing deep neural networks.
We introduce a channel permutation method designed specifically for HiNM sparsity, named gyro-permutation.
- Score: 1.3124513975412255
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: N:M sparsity pruning is a powerful technique for compressing deep neural networks, utilizing NVIDIA's Sparse Tensor Core technology. This method benefits from hardware support for sparse indexing, enabling the adoption of fine-grained sparsity to maintain model accuracy while minimizing the overhead typically associated with irregular data access. Although restricted to a fixed level of sparsity due to its reliance on hardware, N:M sparsity can be combined with coarser sparsity techniques to achieve diverse compression ratios. Initially, column-wise vector sparsity is applied to a dense model, followed by row-wise N:M sparsity on the preserved column vectors. We call this multi-level approach as hierarchical N:M (HiNM) sparsity. Similar to earlier single-level sparsity techniques, HiNM sparsity necessitates an effective channel permutation strategy to maximize the accuracy of the compressed networks. However, it introduces further complexities by requiring the rearrangement of both input and output channels, addressing challenges such as permutation sequence, HiNM-sparsity-aware permutation, and maintaining consistency in channel ordering across layers. In this paper, we introduce a channel permutation method designed specifically for HiNM sparsity, named gyro-permutation. This method is crafted to exploit the unique characteristics of HiNM pruning, incorporating a strategic policy in each permutation phase, including channel sampling, clustering, and assignment, to circumvent local minima. Additionally, we have developed a GPU kernel that facilitates independent layer permutation during the execution of HiNM sparse networks. Our extensive experimental evaluations on various DNN models demonstrate that our gyro-permutation significantly enhances the accuracy of HiNM sparse networks, allowing them to reach performance levels comparable to those of unstructured sparse networks.
Related papers
- SGLP: A Similarity Guided Fast Layer Partition Pruning for Compressing Large Deep Models [19.479746878680707]
Layer pruning is a potent approach to reduce network size and improve computational efficiency.
We propose a Similarity Guided fast Layer Partition pruning for compressing large deep models.
Our method outperforms the state-of-the-art methods in both accuracy and computational efficiency.
arXiv Detail & Related papers (2024-10-14T04:01:08Z) - Scalable Graph Compressed Convolutions [68.85227170390864]
We propose a differentiable method that applies permutations to calibrate input graphs for Euclidean convolution.
Based on the graph calibration, we propose the Compressed Convolution Network (CoCN) for hierarchical graph representation learning.
arXiv Detail & Related papers (2024-07-26T03:14:13Z) - NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions [2.7086888205833968]
Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks.
We propose relaxing the boundaries of neurons and mapping entire sub-networks to a single LUT.
We validate our proposed method on a known latency-critical task, jet substructure tagging, and on the classical computer vision task, digit classification using MNIST.
arXiv Detail & Related papers (2024-02-29T16:10:21Z) - Multicoated and Folded Graph Neural Networks with Strong Lottery Tickets [3.0894823679470087]
This paper introduces the Multi-Stage Folding and Unshared Masks methods to expand the search space in terms of both architecture and parameters.
By achieving high sparsity, competitive performance, and high memory efficiency with up to 98.7% reduction, it demonstrates suitability for energy-efficient graph processing.
arXiv Detail & Related papers (2023-12-06T02:16:44Z) - Heterogenous Memory Augmented Neural Networks [84.29338268789684]
We introduce a novel heterogeneous memory augmentation approach for neural networks.
By introducing learnable memory tokens with attention mechanism, we can effectively boost performance without huge computational overhead.
We show our approach on various image and graph-based tasks under both in-distribution (ID) and out-of-distribution (OOD) conditions.
arXiv Detail & Related papers (2023-10-17T01:05:28Z) - T-GAE: Transferable Graph Autoencoder for Network Alignment [79.89704126746204]
T-GAE is a graph autoencoder framework that leverages transferability and stability of GNNs to achieve efficient network alignment without retraining.
Our experiments demonstrate that T-GAE outperforms the state-of-the-art optimization method and the best GNN approach by up to 38.7% and 50.8%, respectively.
arXiv Detail & Related papers (2023-10-05T02:58:29Z) - Spatial Re-parameterization for N:M Sparsity [92.72334929464013]
N:M sparsity exhibits a fixed sparsity rate within the spatial domains.
unstructured sparsity displays a substantial divergence in sparsity across the spatial domains.
SpRe has achieved a commendable feat by matching the performance of N:M sparsity methods with state-of-the-art unstructured sparsity methods.
arXiv Detail & Related papers (2023-06-09T01:11:50Z) - Learning k-Level Structured Sparse Neural Networks Using Group Envelope Regularization [4.0554893636822]
We introduce a novel approach to deploy large-scale Deep Neural Networks on constrained resources.
The method speeds up inference time and aims to reduce memory demand and power consumption.
arXiv Detail & Related papers (2022-12-25T15:40:05Z) - VQ-GNN: A Universal Framework to Scale up Graph Neural Networks using
Vector Quantization [70.8567058758375]
VQ-GNN is a universal framework to scale up any convolution-based GNNs using Vector Quantization (VQ) without compromising the performance.
Our framework avoids the "neighbor explosion" problem of GNNs using quantized representations combined with a low-rank version of the graph convolution matrix.
arXiv Detail & Related papers (2021-10-27T11:48:50Z) - Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments.
In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z) - Learning Sparse Filters in Deep Convolutional Neural Networks with a
l1/l2 Pseudo-Norm [5.3791844634527495]
Deep neural networks (DNNs) have proven to be efficient for numerous tasks, but come at a high memory and computation cost.
Recent research has shown that their structure can be more compact without compromising their performance.
We present a sparsity-inducing regularization term based on the ratio l1/l2 pseudo-norm defined on the filter coefficients.
arXiv Detail & Related papers (2020-07-20T11:56:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.