SME: ReRAM-based Sparse-Multiplication-Engine to Squeeze-Out Bit
Sparsity of Neural Network
- URL: http://arxiv.org/abs/2103.01705v1
- Date: Tue, 2 Mar 2021 13:27:15 GMT
- Title: SME: ReRAM-based Sparse-Multiplication-Engine to Squeeze-Out Bit
Sparsity of Neural Network
- Authors: Fangxin Liu, Wenbo Zhao, Yilong Zhao, Zongwu Wang, Tao Yang, Zhezhi
He, Naifeng Jing, Xiaoyao Liang, Li Jiang
- Abstract summary: We develop a novel ReRAM-based deep neural network (DNN) accelerator, named Sparse-Multiplication-Engine (SME)
First, we orchestrate the bit-sparse pattern to increase the density of bit-sparsity based on existing quantization methods.
Second, we propose a novel weigh mapping mechanism to slice the bits of a weight across the crossbars and splice the activation results in peripheral circuits.
Third, a superior squeeze-out scheme empties the crossbars mapped with highly-sparse non-zeros from the previous two steps.
- Score: 18.79036546647254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Resistive Random-Access-Memory (ReRAM) crossbar is a promising technique for
deep neural network (DNN) accelerators, thanks to its in-memory and in-situ
analog computing abilities for Vector-Matrix Multiplication-and-Accumulations
(VMMs). However, it is challenging for crossbar architecture to exploit the
sparsity in the DNN. It inevitably causes complex and costly control to exploit
fine-grained sparsity due to the limitation of tightly-coupled crossbar
structure. As the countermeasure, we developed a novel ReRAM-based DNN
accelerator, named Sparse-Multiplication-Engine (SME), based on a hardware and
software co-design framework. First, we orchestrate the bit-sparse pattern to
increase the density of bit-sparsity based on existing quantization methods.
Second, we propose a novel weigh mapping mechanism to slice the bits of a
weight across the crossbars and splice the activation results in peripheral
circuits. This mechanism can decouple the tightly-coupled crossbar structure
and cumulate the sparsity in the crossbar. Finally, a superior squeeze-out
scheme empties the crossbars mapped with highly-sparse non-zeros from the
previous two steps. We design the SME architecture and discuss its use for
other quantization methods and different ReRAM cell technologies. Compared with
prior state-of-the-art designs, the SME shrinks the use of crossbars up to 8.7x
and 2.1x using Resent-50 and MobileNet-v2, respectively, with less than 0.3%
accuracy drop on ImageNet.
Related papers
- RNC: Efficient RRAM-aware NAS and Compilation for DNNs on Resource-Constrained Edge Devices [0.30458577208819987]
We aim to develop edge-friendly deep neural networks (DNNs) for accelerators based on resistive random-access memory (RRAM)
We propose an edge compilation and resource-constrained RRAM-aware neural architecture search (NAS) framework to search for optimized neural networks meeting specific hardware constraints.
The resulting model from NAS optimized for speed achieved 5x-30x speedup.
arXiv Detail & Related papers (2024-09-27T15:35:36Z) - BasisN: Reprogramming-Free RRAM-Based In-Memory-Computing by Basis Combination for Deep Neural Networks [9.170451418330696]
We propose BasisN framework to accelerate deep neural networks (DNNs) on any number of crossbars without reprogramming.
We show that cycles per inference and energy-delay product were reduced to below 1% compared with applying reprogramming on crossbars.
arXiv Detail & Related papers (2024-07-04T08:47:05Z) - MST-compression: Compressing and Accelerating Binary Neural Networks
with Minimum Spanning Tree [21.15961593182111]
Binary neural networks (BNNs) have been widely adopted to reduce the computational cost and memory storage on edge-computing devices.
However, as neural networks become wider/deeper to improve accuracy and meet practical requirements, the computational burden remains a significant challenge even on the binary version.
This paper proposes a novel method called Minimum Spanning Tree (MST) compression that learns to compress and accelerate BNNs.
arXiv Detail & Related papers (2023-08-26T02:42:12Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution [91.3781512926942]
Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures.
This work investigates the potential of network pruning for super-resolution iteration to take advantage of off-the-shelf network designs and reduce the underlying computational overhead.
We propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly network at each and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly.
arXiv Detail & Related papers (2023-03-16T21:06:13Z) - BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to
Real-Network Performance [54.214426436283134]
Deep neural networks, such as the Deep-FSMN, have been widely studied for keyword spotting (KWS) applications.
We present a strong yet efficient binary neural network for KWS, namely BiFSMNv2, pushing it to the real-network accuracy performance.
We highlight that benefiting from the compact architecture and optimized hardware kernel, BiFSMNv2 can achieve an impressive 25.1x speedup and 20.2x storage-saving on edge hardware.
arXiv Detail & Related papers (2022-11-13T18:31:45Z) - Efficient Micro-Structured Weight Unification and Pruning for Neural
Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices.
Previous unstructured or structured weight pruning methods can hardly truly accelerate inference.
We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z) - Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments.
In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z) - Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [14.958793135751149]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM)
Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead.
We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
arXiv Detail & Related papers (2020-09-04T20:17:42Z) - Hardware-Centric AutoML for Mixed-Precision Quantization [34.39845532939529]
Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way.
In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy.
Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization.
arXiv Detail & Related papers (2020-08-11T17:30:22Z) - When Residual Learning Meets Dense Aggregation: Rethinking the
Aggregation of Deep Neural Networks [57.0502745301132]
We propose Micro-Dense Nets, a novel architecture with global residual learning and local micro-dense aggregations.
Our micro-dense block can be integrated with neural architecture search based models to boost their performance.
arXiv Detail & Related papers (2020-04-19T08:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.