PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation
Invariant Transformation
- URL: http://arxiv.org/abs/2301.10936v2
- Date: Sun, 8 Oct 2023 01:33:15 GMT
- Title: PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation
Invariant Transformation
- Authors: Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua Han, Yuqing Yang,
Lingxiao Ma, Fan Yang, Chengruidong Zhang, Lili Qiu, Mao Yang, Lidong Zhou
- Abstract summary: We propose Permutation Invariant Transformation (PIT) for dynamic sparsity computation.
PIT transforms micro-tiles into a GPU-efficient dense tile without changing the results.
It can accelerate dynamic sparsity computation by up to 5.9x (average 2.43x) over state-of-the-art compilers.
- Score: 15.860204740425791
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dynamic sparsity, where the sparsity patterns are unknown until runtime,
poses a significant challenge to deep learning. The state-of-the-art
sparsity-aware deep learning solutions are restricted to pre-defined, static
sparsity patterns due to significant overheads associated with preprocessing.
Efficient execution of dynamic sparse computation often faces the misalignment
between the GPU-friendly tile configuration for efficient execution and the
sparsity-aware tile shape that minimizes coverage wastes (non-zero values in
tensor).
In this paper, we propose PIT, a deep-learning compiler for dynamic sparsity.
PIT proposes a novel tiling mechanism that leverages Permutation Invariant
Transformation (PIT), a mathematically proven property, to transform multiple
sparsely located micro-tiles into a GPU-efficient dense tile without changing
the computation results, thus achieving both high GPU utilization and low
coverage waste. Given a model, PIT first finds feasible PIT rules for all its
operators and generates efficient GPU kernels accordingly. At runtime, with the
novel SRead and SWrite primitives, PIT rules can be executed extremely fast to
support dynamic sparsity in an online manner. Extensive evaluation on diverse
models shows that PIT can accelerate dynamic sparsity computation by up to 5.9x
(average 2.43x) over state-of-the-art compilers.
Related papers
- Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks.
It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping.
It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z) - Dynamic Sparse Training with Structured Sparsity [11.778353786208765]
Dynamic Sparse Training (DST) methods achieve state-of-the-art results in sparse neural network training.
We propose a sparse-to-sparse DST method, Structured RigL (SRigL), to learn a variant of fine-grained structured N:M sparsity.
We demonstrate a real-world acceleration of 3.4x/2.5x on CPU for online inference and 1.7x/13.0x on GPU for inference with a batch size of 256.
arXiv Detail & Related papers (2023-05-03T17:48:55Z) - Dynamically Reconfigurable Variable-precision Sparse-Dense Matrix
Acceleration in Tensorflow Lite [0.0]
We present a dynamically reconfigurable hardware accelerator called FADES (Fused Architecture for DEnse and Sparse matrices)
The FADES design offers multiple configuration options that trade off complexity and parallelism using a dataflow model to create four stages that read, compute, scale and write results.
We show that the core can outperform dense mode even at low sparsity levels, and a single-core achieves up to 20x acceleration over the software-optimized NEON RUY library.
arXiv Detail & Related papers (2023-04-17T12:31:50Z) - PopSparse: Accelerated block sparse matrix multiplication on IPU [0.5661403709207713]
We introduce PopSparse, a library that enables fast sparse operations on Graphcore IPUs.
We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run.
Results indicate that the PopSparse implementations are faster than dense matrix multiplications on IPU at a range of sparsity levels.
arXiv Detail & Related papers (2023-03-29T20:00:19Z) - Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states.
Our method is widely applicable to classical DP-based inference.
It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z) - Neural Stochastic Dual Dynamic Programming [99.80617899593526]
We introduce a trainable neural model that learns to map problem instances to a piece-wise linear value function.
$nu$-SDDP can significantly reduce problem solving cost without sacrificing solution quality.
arXiv Detail & Related papers (2021-12-01T22:55:23Z) - Dual-side Sparse Tensor Core [18.204976918925635]
Existing GPUs can only leverage the sparsity from weights but not activations, which are dynamic, unpredictable, and hence challenging to exploit.
We propose a novel architecture to efficiently harness the dual-side sparsity (i.e., weight and activation sparsity)
Our design can fully unleash the dual-side sparsity and improve the performance by up to one order of magnitude with hlsmall hardware overhead.
arXiv Detail & Related papers (2021-05-20T07:36:16Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments.
In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z) - Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise
Sparsity [12.643043455369297]
We propose an algorithm-software co-designed pruning method that achieves latency speedups on existing dense architectures.
We implement and evaluate the sparsity pattern on GPU tensor core, achieving a 1.95x speedup over the dense model.
arXiv Detail & Related papers (2020-08-29T16:27:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.