Related papers: TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training and Inference

TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training and Inference

URL: http://arxiv.org/abs/2009.00748v1
Date: Tue, 1 Sep 2020 23:39:35 GMT
Title: TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training and Inference
Authors: Mostafa Mahmoud, Isak Edo, Ali Hadi Zadeh, Omar Mohamed Awad, Gennady Pekhimenko, Jorge Albericio, and Andreas Moshovos
Abstract summary: Dash is a hardware level technique for enabling data-parallel MAC units to take advantage of sparsity in their input operand streams. When used to compose a hardware accelerator for deep learning,Dash can speedup the training process while also increasing energy efficiency.
Score: 3.238873941995477
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: TensorDash is a hardware level technique for enabling data-parallel MAC units to take advantage of sparsity in their input operand streams. When used to compose a hardware accelerator for deep learning, TensorDash can speedup the training process while also increasing energy efficiency. TensorDash combines a low-cost, sparse input operand interconnect comprising an 8-input multiplexer per multiplier input, with an area-efficient hardware scheduler. While the interconnect allows a very limited set of movements per operand, the scheduler can effectively extract sparsity when it is present in the activations, weights or gradients of neural networks. Over a wide set of models covering various applications, TensorDash accelerates the training process by $1.95{\times}$ while being $1.89\times$ more energy-efficient, $1.6\times$ more energy efficient when taking on-chip and off-chip memory accesses into account. While TensorDash works with any datatype, we demonstrate it with both single-precision floating-point units and bfloat16.

Related papers

Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators [0.0]
Deep Neural Networks (DNNs) are being developed, trained, and utilized, putting a strain on both advanced and limited devices. Our solution is to implement em weight block sparsity, which is a structured sparsity that is friendly to hardware. We will present performance estimates using accurate and complete code generation for AIE2 configuration sets (AMD Versal FPGAs) with Resnet50, Inception V3, and VGG16.
arXiv Detail & Related papers (2024-07-12T17:37:49Z)
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion [9.5114389643299]
This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPU with various GPU generations and interconnects.
arXiv Detail & Related papers (2024-06-11T00:17:39Z)
Efficient Asynchronous Federated Learning with Sparsification and Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data. FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training. We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z)
Dynamic Sparsity Is Channel-Level Sparsity Learner [91.31071026340746]
Dynamic sparse training (DST) is a leading sparse training approach. Channel-aware dynamic sparse (Chase) seamlessly translates the promise of unstructured dynamic sparsity to channel-level sparsity. Our approach translates unstructured sparsity to channel-wise sparsity.
arXiv Detail & Related papers (2023-05-30T23:33:45Z)
LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight Grouping for Multi-Agent Reinforcement Learning [2.0625936401496237]
Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems. We present a real-time sparse training acceleration system named LearningGroup. Our system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x.
arXiv Detail & Related papers (2022-10-29T15:09:34Z)
PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time. PARTIME starts processing each data sample at the time in which it becomes available from the stream. Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z)
ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning. ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models. Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z)
Exploiting Activation based Gradient Output Sparsity to Accelerate Backpropagation in CNNs [15.465530153038927]
Machine/deep-learning (ML/DL) based techniques are emerging as a driving force behind many cutting-edge technologies. However, training these models involving large parameters is both time-consuming and energy-hogging.
arXiv Detail & Related papers (2021-09-16T04:12:51Z)
Training for temporal sparsity in deep neural networks, application in video processing [0.30458514384586394]
Activation sparsity improves compute efficiency and resource utilization in sparsity-aware neural network accelerators. We introduce a new layer (called Delta Activation Layer) to promote temporal sparsity of activations during training. We report an almost 3x improvement of activation sparsity, with recoverable loss of model accuracy after longer training.
arXiv Detail & Related papers (2021-07-15T13:17:11Z)
FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks. Current networks often occupy large number of parameters and require heavy computation costs. Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
ShiftAddNet: A Hardware-Inspired Deep Network [87.18216601210763]
ShiftAddNet is an energy-efficient multiplication-less deep neural network. It leads to both energy-efficient inference and training, without compromising expressive capacity. ShiftAddNet aggressively reduces over 80% hardware-quantified energy cost of DNNs training and inference, while offering comparable or better accuracies.
arXiv Detail & Related papers (2020-10-24T05:09:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.