Related papers: Exploiting Activation based Gradient Output Sparsity to Accelerate Backpropagation in CNNs

Exploiting Activation based Gradient Output Sparsity to Accelerate Backpropagation in CNNs

URL: http://arxiv.org/abs/2109.07710v1
Date: Thu, 16 Sep 2021 04:12:51 GMT
Title: Exploiting Activation based Gradient Output Sparsity to Accelerate Backpropagation in CNNs
Authors: Anup Sarma, Sonali Singh, Huaipan Jiang, Ashutosh Pattnaik, Asit K Mishra, Vijaykrishnan Narayanan, Mahmut T Kandemir and Chita R Das
Abstract summary: Machine/deep-learning (ML/DL) based techniques are emerging as a driving force behind many cutting-edge technologies. However, training these models involving large parameters is both time-consuming and energy-hogging.
Score: 15.465530153038927
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine/deep-learning (ML/DL) based techniques are emerging as a driving force behind many cutting-edge technologies, achieving high accuracy on computer vision workloads such as image classification and object detection. However, training these models involving large parameters is both time-consuming and energy-hogging. In this regard, several prior works have advocated for sparsity to speed up the of DL training and more so, the inference phase. This work begins with the observation that during training, sparsity in the forward and backward passes are correlated. In that context, we investigate two types of sparsity (input and output type) inherent in gradient descent-based optimization algorithms and propose a hardware micro-architecture to leverage the same. Our experimental results use five state-of-the-art CNN models on the Imagenet dataset, and show back propagation speedups in the range of 1.69$\times$ to 5.43$\times$, compared to the dense baseline execution. By exploiting sparsity in both the forward and backward passes, speedup improvements range from 1.68$\times$ to 3.30$\times$ over the sparsity-agnostic baseline execution. Our work also achieves significant reduction in training iteration time over several previously proposed dense as well as sparse accelerator based platforms, in addition to achieving order of magnitude energy efficiency improvements over GPU based execution.

Related papers

FastMIM: Expediting Masked Image Modeling Pre-training for Vision [65.47756720190155]
FastMIM is a framework for pre-training vision backbones with low-resolution input images. It reconstructs Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images. It can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones.
arXiv Detail & Related papers (2022-12-13T14:09:32Z)
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models [158.19276683455254]
Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first second-order moments of gradient for accelerating convergence. Nesterov acceleration converges faster than ball acceleration in theory and also in many empirical cases. In this paper we develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the point. We show that Adan surpasses the corresponding SoTAs on both vision transformers (ViTs and CNNs) and sets new SoTAs for many popular networks.
arXiv Detail & Related papers (2022-08-13T16:04:39Z)
FastHebb: Scaling Hebbian Training of Deep Neural Networks to ImageNet Level [7.410940271545853]
We present FastHebb, an efficient and scalable solution for Hebbian learning. FastHebb outperforms previous solutions by up to 50 times in terms of training speed. For the first time, we are able to bring Hebbian algorithms to ImageNet scale.
arXiv Detail & Related papers (2022-07-07T09:04:55Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples. We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment. We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z)
Speedup deep learning models on GPU by taking advantage of efficient unstructured pruning and bit-width reduction [0.0]
This work is focused on the pruning of some convolutional neural networks (CNNs) and improving theirs efficiency on graphic processing units ( GPU) The Nvidia deep neural network (cuDnn) library is the most effective implementations of deep learning (DL) algorithms for GPUs.
arXiv Detail & Related papers (2021-12-28T19:36:41Z)
Boosting Fast Adversarial Training with Learnable Adversarial Initialization [79.90495058040537]
Adrial training (AT) has been demonstrated to be effective in improving model robustness by leveraging adversarial examples for training. To boost training efficiency, fast gradient sign method (FGSM) is adopted in fast AT methods by calculating gradient only once.
arXiv Detail & Related papers (2021-10-11T05:37:00Z)
Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping [24.547833264405355]
The proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 times faster than the baseline. While being faster, our pre-trained models are equipped with strong knowledge transferability, achieving comparable and sometimes higher GLUE score than the baseline.
arXiv Detail & Related papers (2020-10-26T06:50:07Z)
Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network Training [0.5219568203653523]
We develop a sparse DNN training accelerator that produces pruned models with the same accuracy as dense models without first training, then pruning, and finally retraining, a dense model. Compared to training the equivalent unpruned models using a state-of-the-art DNN accelerator without sparse training support, Procrustes consumes up to 3.26$times$ less energy and offers up to 4$times$ speedup across a range of models, while pruning weights by an order of magnitude and maintaining unpruned accuracy.
arXiv Detail & Related papers (2020-09-23T07:39:55Z)
Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications. In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training. Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.