Bi-directional Masks for Efficient N:M Sparse Training
- URL: http://arxiv.org/abs/2302.06058v1
- Date: Mon, 13 Feb 2023 02:32:02 GMT
- Title: Bi-directional Masks for Efficient N:M Sparse Training
- Authors: Yuxin Zhang, Yiting Luo, Mingbao Lin, Yunshan Zhong, Jingjing Xie, Fei
Chao, Rongrong Ji
- Abstract summary: We present a novel method of Bi-directional Masks (Bi-Mask) with its two central innovations.
It disentangles the forward and backward weight sparsity and overcomes the very dense gradient.
Compared with existing uni-directional scenario that applies a transposable mask and enables backward acceleration, our Bi-Mask is experimentally demonstrated to be more superior in performance.
- Score: 64.9617631724811
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We focus on addressing the dense backward propagation issue for training
efficiency of N:M fine-grained sparsity that preserves at most N out of M
consecutive weights and achieves practical speedups supported by the N:M sparse
tensor core. Therefore, we present a novel method of Bi-directional Masks
(Bi-Mask) with its two central innovations in: 1) Separate sparse masks in the
two directions of forward and backward propagation to obtain training
acceleration. It disentangles the forward and backward weight sparsity and
overcomes the very dense gradient computation. 2) An efficient weight row
permutation method to maintain performance. It picks up the permutation
candidate with the most eligible N:M weight blocks in the backward to minimize
the gradient gap between traditional uni-directional masks and our
bi-directional masks. Compared with existing uni-directional scenario that
applies a transposable mask and enables backward acceleration, our Bi-Mask is
experimentally demonstrated to be more superior in performance. Also, our
Bi-Mask performs on par with or even better than methods that fail to achieve
backward acceleration. Project of this paper is available at
\url{https://github.com/zyxxmu/Bi-Mask}.
Related papers
- Efficiently Dispatching Flash Attention For Partially Filled Attention Masks [29.36452085947087]
Transformers are widely used across various applications, many of which yield sparse or partially filled attention matrices.
We introduce Binary Block Masking, a highly efficient modification that enhances Flash Attention by making it mask-aware.
Our experiments on attention masks derived from real-world scenarios demonstrate up to a 9x runtime improvement.
arXiv Detail & Related papers (2024-09-23T15:11:07Z) - MP-Former: Mask-Piloted Transformer for Image Segmentation [16.620469868310288]
Mask2Former suffers from inconsistent mask predictions between decoder layers.
We propose a mask-piloted training approach, which feeds noised ground-truth masks in masked-attention and trains the model to reconstruct the original ones.
arXiv Detail & Related papers (2023-03-13T17:57:59Z) - Non-Iterative Scribble-Supervised Learning with Pacing Pseudo-Masks for
Medical Image Segmentation [13.940364677162968]
Scribble-supervised medical image segmentation tackles the limitation of sparse masks.
We propose a non-iterative method where a stream of varying (pacing) pseudo-masks teach a network via consistency training, named PacingPseudo.
The efficacy of the proposed PacingPseudo is validated on three public medical image datasets.
arXiv Detail & Related papers (2022-10-20T01:57:44Z) - Optimizing Gradient-driven Criteria in Network Sparsity: Gradient is All
You Need [74.58939318994746]
gradient-driven sparsity is used to reduce network complexity.
Weight independence is contrary to the fact that weights are mutually influenced.
We propose to further optimize gradient-driven sparsity (OptG) by solving this independence paradox.
arXiv Detail & Related papers (2022-01-30T14:15:49Z) - Mask Transfiner for High-Quality Instance Segmentation [95.74244714914052]
We present Mask Transfiner for high-quality and efficient instance segmentation.
Our approach only processes detected error-prone tree nodes and self-corrects their errors in parallel.
Our code and trained models will be available at http://vis.xyz/pub/transfiner.
arXiv Detail & Related papers (2021-11-26T18:58:22Z) - Accelerated Sparse Neural Training: A Provable and Efficient Method to
Find N:M Transposable Masks [28.498176073737422]
Recently, researchers proposed pruning deep neural network weights (DNNs) using an $N:M$ fine-grained block sparsity mask.
We propose a novel transposable-fine-grained sparsity mask where the same mask can be used for both forward and backward passes.
Our experiments suggest 2x speed-up with no accuracy degradation over vision and language models.
arXiv Detail & Related papers (2021-02-16T12:44:16Z) - KSM: Fast Multiple Task Adaption via Kernel-wise Soft Mask Learning [49.77278179376902]
Deep Neural Networks (DNN) could forget the knowledge about earlier tasks when learning new tasks, and this is known as textitcatastrophic forgetting.
Recent continual learning methods are capable of alleviating the catastrophic problem on toy-sized datasets.
We propose a new training method called textit- Kernel-wise Soft Mask (KSM), which learns a kernel-wise hybrid binary and real-value soft mask for each task.
arXiv Detail & Related papers (2020-09-11T21:48:39Z) - Ternary Feature Masks: zero-forgetting for task-incremental learning [68.34518408920661]
We propose an approach without any forgetting to continual learning for the task-aware regime.
By using ternary masks we can upgrade a model to new tasks, reusing knowledge from previous tasks while not forgetting anything about them.
Our method outperforms current state-of-the-art while reducing memory overhead in comparison to weight-based approaches.
arXiv Detail & Related papers (2020-01-23T18:08:37Z) - BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation [103.74690082121079]
In this work, we achieve improved mask prediction by effectively combining instance-level information with semantic information with lower-level fine-granularity.
Our main contribution is a blender module which draws inspiration from both top-down and bottom-up instance segmentation approaches.
BlendMask can effectively predict dense per-pixel position-sensitive instance features with very few channels, and learn attention maps for each instance with merely one convolution layer.
arXiv Detail & Related papers (2020-01-02T03:30:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.