Term Revealing: Furthering Quantization at Run Time on Quantized DNNs
- URL: http://arxiv.org/abs/2007.06389v2
- Date: Sun, 26 Jul 2020 19:24:51 GMT
- Title: Term Revealing: Furthering Quantization at Run Time on Quantized DNNs
- Authors: H. T. Kung, Bradley McDanel, Sai Qian Zhang
- Abstract summary: We present a novel technique, called Term Revealing (TR), for furthering quantization at run time for improved performance of Deep Neural Networks (DNNs) already quantized with conventional quantization methods.
TR operates on power-of-two terms in binary expressions of values.
We show an FPGA implementation that can use a small number of control bits to switch between conventional quantization and TR-enabled quantization with a negligible delay.
- Score: 9.240133036531402
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel technique, called Term Revealing (TR), for furthering
quantization at run time for improved performance of Deep Neural Networks
(DNNs) already quantized with conventional quantization methods. TR operates on
power-of-two terms in binary expressions of values. In computing a dot-product
computation, TR dynamically selects a fixed number of largest terms to use from
the values of the two vectors in the dot product. By exploiting normal-like
weight and data distributions typically present in DNNs, TR has a minimal
impact on DNN model performance (i.e., accuracy or perplexity). We use TR to
facilitate tightly synchronized processor arrays, such as systolic arrays, for
efficient parallel processing. We show an FPGA implementation that can use a
small number of control bits to switch between conventional quantization and
TR-enabled quantization with a negligible delay. To enhance TR efficiency
further, we use a signed digit representation (SDR), as opposed to classic
binary encoding with only nonnegative power-of-two terms. To perform conversion
from binary to SDR, we develop an efficient encoding method called HESE (Hybrid
Encoding for Signed Expressions) that can be performed in one pass looking at
only two bits at a time. We evaluate TR with HESE encoded values on an MLP for
MNIST, multiple CNNs for ImageNet, and an LSTM for Wikitext-2, and show
significant reductions in inference computations (between 3-10x) compared to
conventional quantization for the same level of model performance.
Related papers
- Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - Attention as an RNN [66.5420926480473]
We show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its textitmany-to-one RNN output efficiently.
We introduce a new efficient method of computing attention's textitmany-to-many RNN output based on the parallel prefix scan algorithm.
We show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings.
arXiv Detail & Related papers (2024-05-22T19:45:01Z) - TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - BiPer: Binary Neural Networks using a Periodic Function [17.461853355858022]
Quantized neural networks employ reduced precision representations for both weights and activations.
Binary Neural Networks (BNNs) are the extreme quantization case, representing values with just one bit.
In contrast to current BNN approaches, we propose to employ a binary periodic (BiPer) function during binarization.
arXiv Detail & Related papers (2024-04-01T17:52:17Z) - Compacting Binary Neural Networks by Sparse Kernel Selection [58.84313343190488]
This paper is motivated by a previously revealed phenomenon that the binary kernels in successful BNNs are nearly power-law distributed.
We develop the Permutation Straight-Through Estimator (PSTE) that is able to not only optimize the selection process end-to-end but also maintain the non-repetitive occupancy of selected codewords.
Experiments verify that our method reduces both the model size and bit-wise computational costs, and achieves accuracy improvements compared with state-of-the-art BNNs under comparable budgets.
arXiv Detail & Related papers (2023-03-25T13:53:02Z) - ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural
Network Quantization [31.494669469303954]
We propose a fixed-length adaptive numerical data type called ANT to achieve low-bit quantization with tiny hardware overheads.
Our design results in 2.8$times$ speedup and 2.5$times$ energy efficiency improvement over the state-of-the-art quantization accelerators.
arXiv Detail & Related papers (2022-08-30T14:12:49Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - FATNN: Fast and Accurate Ternary Neural Networks [89.07796377047619]
Ternary Neural Networks (TNNs) have received much attention due to being potentially orders of magnitude faster in inference, as well as more power efficient, than full-precision counterparts.
In this work, we show that, under some mild constraints, computational complexity of the ternary inner product can be reduced by a factor of 2.
We elaborately design an implementation-dependent ternary quantization algorithm to mitigate the performance gap.
arXiv Detail & Related papers (2020-08-12T04:26:18Z) - Compressing deep neural networks on FPGAs to binary and ternary
precision with HLS4ML [13.325670094073383]
We present the implementation of binary and ternary neural networks in the hls4ml library.
We discuss the trade-off between model accuracy and resource consumption.
The binary and ternary implementation has similar performance to the higher precision implementation while using drastically fewer FPGA resources.
arXiv Detail & Related papers (2020-03-11T10:46:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.