Minimum Variance Unbiased N:M Sparsity for the Neural Gradients
- URL: http://arxiv.org/abs/2203.10991v3
- Date: Sun, 9 Jun 2024 16:21:18 GMT
- Title: Minimum Variance Unbiased N:M Sparsity for the Neural Gradients
- Authors: Brian Chmiel, Itay Hubara, Ron Banner, Daniel Soudry,
- Abstract summary: In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) up to x2.
We examine how this method can be used also for the neural gradients.
- Score: 29.555643722721882
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) up to x2, and doubles throughput by skipping computation of zero values. So far, it was mainly only used to prune weights to accelerate the forward and backward phases. We examine how this method can be used also for the neural gradients (i.e., loss gradients with respect to the intermediate neural layer outputs). To this end, we first establish a tensor-level optimality criteria. Previous works aimed to minimize the mean-square-error (MSE) of each pruned block. We show that while minimization of the MSE works fine for pruning the weights and activations, it catastrophically fails for the neural gradients. Instead, we show that accurate pruning of the neural gradients requires an unbiased minimum-variance pruning mask. We design such specialized masks, and find that in most cases, 1:2 sparsity is sufficient for training, and 2:4 sparsity is usually enough when this is not the case. Further, we suggest combining several such methods together in order to potentially speed up training even more.
Related papers
- S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training [20.113352600259226]
We propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor.
Results show that our method surpass previous 2:4 pre-training recipes and is comparable even with full parameter models.
arXiv Detail & Related papers (2024-09-13T08:29:36Z) - When Expressivity Meets Trainability: Fewer than $n$ Neurons Can Work [59.29606307518154]
We show that as long as the width $m geq 2n/d$ (where $d$ is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss.
We also consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer.
arXiv Detail & Related papers (2022-10-21T14:41:26Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Fire Together Wire Together: A Dynamic Pruning Approach with
Self-Supervised Mask Prediction [12.86325214182021]
Dynamic model pruning is a recent direction that allows for the inference of a different sub-network for each input sample during deployment.
Current dynamic methods rely on learning a continuous channel gating through regularization by inducing sparsity loss.
We show experiments on several neural architectures, such as VGG, ResNet, and MobileNet on CIFAR and ImageNet.
arXiv Detail & Related papers (2021-10-15T17:39:53Z) - Large Scale Private Learning via Low-rank Reparametrization [77.38947817228656]
We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks.
We are the first able to apply differential privacy on the BERT model and achieve an average accuracy of $83.9%$ on four downstream tasks.
arXiv Detail & Related papers (2021-06-17T10:14:43Z) - Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z) - Superpolynomial Lower Bounds for Learning One-Layer Neural Networks
using Gradient Descent [25.589302381660453]
We show that any trained using gradient descent with respect to square-loss distribution will fail to achieve small test error in time.
For classification, we give a stronger result, namely that any statistical query (SQ) will fail to achieve small test error in time.
arXiv Detail & Related papers (2020-06-22T05:15:06Z) - Neural gradients are near-lognormal: improved quantized and sparse
training [35.28451407313548]
We find that the distribution of neural gradients is approximately lognormal.
We suggest two closed-form analytical methods to reduce the computational and memory burdens of neural gradients.
To the best of our knowledge, this paper is the first to (1) quantize the gradients to 6-bit floating-point formats, or (2) achieve up to 85% gradient sparsity -- in each case without accuracy.
arXiv Detail & Related papers (2020-06-15T07:00:15Z) - Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality
Regularization and Singular Value Sparsification [53.50708351813565]
We propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step.
We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy.
arXiv Detail & Related papers (2020-04-20T02:40:43Z) - Improving the Backpropagation Algorithm with Consequentialism Weight
Updates over Mini-Batches [0.40611352512781856]
We show that it is possible to consider a multi-layer neural network as a stack of adaptive filters.
We introduce a better algorithm by predicting then emending the adverse consequences of the actions that take place in BP even before they happen.
Our experiments show the usefulness of our algorithm in the training of deep neural networks.
arXiv Detail & Related papers (2020-03-11T08:45:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.