CREW: Computation Reuse and Efficient Weight Storage for
Hardware-accelerated MLPs and RNNs
- URL: http://arxiv.org/abs/2107.09408v1
- Date: Tue, 20 Jul 2021 11:10:54 GMT
- Title: CREW: Computation Reuse and Efficient Weight Storage for
Hardware-accelerated MLPs and RNNs
- Authors: Marc Riera, Jose-Maria Arnau, Antonio Gonzalez
- Abstract summary: We present CREW, a hardware accelerator that implements Reuse and an Efficient Weight Storage mechanism.
CREW greatly reduces the number of multiplications and provides significant savings in model memory footprint and memory bandwidth usage.
On average, CREW provides 2.61x speedup and 2.42x energy savings over a TPU-like accelerator.
- Score: 1.0635248457021496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Neural Networks (DNNs) have achieved tremendous success for cognitive
applications. The core operation in a DNN is the dot product between quantized
inputs and weights. Prior works exploit the weight/input repetition that arises
due to quantization to avoid redundant computations in Convolutional Neural
Networks (CNNs). However, in this paper we show that their effectiveness is
severely limited when applied to Fully-Connected (FC) layers, which are
commonly used in state-of-the-art DNNs, as it is the case of modern Recurrent
Neural Networks (RNNs) and Transformer models.
To improve energy-efficiency of FC computation we present CREW, a hardware
accelerator that implements Computation Reuse and an Efficient Weight Storage
mechanism to exploit the large number of repeated weights in FC layers. CREW
first performs the multiplications of the unique weights by their respective
inputs and stores the results in an on-chip buffer. The storage requirements
are modest due to the small number of unique weights and the relatively small
size of the input compared to convolutional layers. Next, CREW computes each
output by fetching and adding its required products. To this end, each weight
is replaced offline by an index in the buffer of unique products. Indices are
typically smaller than the quantized weights, since the number of unique
weights for each input tends to be much lower than the range of quantized
weights, which reduces storage and memory bandwidth requirements.
Overall, CREW greatly reduces the number of multiplications and provides
significant savings in model memory footprint and memory bandwidth usage. We
evaluate CREW on a diverse set of modern DNNs. On average, CREW provides 2.61x
speedup and 2.42x energy savings over a TPU-like accelerator. Compared to UCNN,
a state-of-art computation reuse technique, CREW achieves 2.10x speedup and
2.08x energy savings on average.
Related papers
- Sorted Weight Sectioning for Energy-Efficient Unstructured Sparse DNNs on Compute-in-Memory Crossbars [4.089232204089156]
$textitsorted weight sectioning$ (SWS) is a weight allocation algorithm that places sorted deep neural network (DNN) weight sections on bit-sliced compute-in-memory (CIM) crossbars.
Our method reduces ADC energy use by 89.5% on unstructured sparse BERT models.
arXiv Detail & Related papers (2024-10-15T05:37:16Z) - Kolmogorov-Arnold Transformer [72.88137795439407]
We introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces layers with Kolmogorov-Arnold Network (KAN) layers.
We identify three key challenges: (C1) Base function, (C2) Inefficiency, and (C3) Weight.
With these designs, KAT outperforms traditional-based transformers.
arXiv Detail & Related papers (2024-09-16T17:54:51Z) - Energy Efficient Hardware Acceleration of Neural Networks with
Power-of-Two Quantisation [0.0]
We show that a hardware neural network accelerator with PoT weights implemented on the Zynq UltraScale + MPSoC ZCU104 FPGA can be at least $1.4x$ more energy efficient than the uniform quantisation version.
arXiv Detail & Related papers (2022-09-30T06:33:40Z) - CoNLoCNN: Exploiting Correlation and Non-Uniform Quantization for
Energy-Efficient Low-precision Deep Convolutional Neural Networks [13.520972975766313]
We propose a framework to enable energy-efficient low-precision deep convolutional neural network inference by exploiting non-uniform quantization of weights.
We also propose a novel data representation format, Encoded Low-Precision Binary Signed Digit, to compress the bit-width of weights.
arXiv Detail & Related papers (2022-07-31T01:34:56Z) - BiTAT: Neural Network Binarization with Task-dependent Aggregated
Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation.
Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration.
This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z) - Low-Precision Training in Logarithmic Number System using Multiplicative
Weight Update [49.948082497688404]
Training large-scale deep neural networks (DNNs) currently requires a significant amount of energy, leading to serious environmental impacts.
One promising approach to reduce the energy costs is representing DNNs with low-precision numbers.
We jointly design a lowprecision training framework involving a logarithmic number system (LNS) and a multiplicative weight update training method, termed LNS-Madam.
arXiv Detail & Related papers (2021-06-26T00:32:17Z) - SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and
Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage.
We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation.
We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z) - ShiftAddNet: A Hardware-Inspired Deep Network [87.18216601210763]
ShiftAddNet is an energy-efficient multiplication-less deep neural network.
It leads to both energy-efficient inference and training, without compromising expressive capacity.
ShiftAddNet aggressively reduces over 80% hardware-quantified energy cost of DNNs training and inference, while offering comparable or better accuracies.
arXiv Detail & Related papers (2020-10-24T05:09:14Z) - BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based
Quantized DNNs [7.635154697466773]
The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy.
We propose a novel matrix multiplication method, called BiQGEMM, dedicated to quantized DNNs.
arXiv Detail & Related papers (2020-05-20T08:15:33Z) - SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost
Computation [97.78417228445883]
We present SmartExchange, an algorithm- hardware co-design framework for energy-efficient inference of deep neural networks (DNNs)
We develop a novel algorithm to enforce a specially favorable DNN weight structure, where each layerwise weight matrix can be stored as the product of a small basis matrix and a large sparse coefficient matrix whose non-zero elements are all power-of-2.
We further design a dedicated accelerator to fully utilize the SmartExchange-enforced weights to improve both energy efficiency and latency performance.
arXiv Detail & Related papers (2020-05-07T12:12:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.