Convolutional Neural Networks Quantization with Attention
- URL: http://arxiv.org/abs/2209.15317v1
- Date: Fri, 30 Sep 2022 08:48:31 GMT
- Title: Convolutional Neural Networks Quantization with Attention
- Authors: Binyi Wu, Bernd Waschneck, Christian Georg Mayr
- Abstract summary: We propose a method, double-stage Squeeze-and-Threshold (double-stage ST)
It uses the attention mechanism to quantize networks and achieve state-of-art results.
- Score: 1.0312968200748118
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It has been proven that, compared to using 32-bit floating-point numbers in
the training phase, Deep Convolutional Neural Networks (DCNNs) can operate with
low precision during inference, thereby saving memory space and power
consumption. However, quantizing networks is always accompanied by an accuracy
decrease. Here, we propose a method, double-stage Squeeze-and-Threshold
(double-stage ST). It uses the attention mechanism to quantize networks and
achieve state-of-art results. Using our method, the 3-bit model can achieve
accuracy that exceeds the accuracy of the full-precision baseline model. The
proposed double-stage ST activation quantization is easy to apply: inserting it
before the convolution.
Related papers
- Mixed Precision Post Training Quantization of Neural Networks with
Sensitivity Guided Search [7.392278887917975]
Mixed-precision quantization allows different tensors to be quantized to varying levels of numerical precision.
We evaluate our method for computer vision and natural language processing and demonstrate latency reductions of up to 27.59% and 34.31%.
arXiv Detail & Related papers (2023-02-02T19:30:00Z) - Automatic Network Adaptation for Ultra-Low Uniform-Precision
Quantization [6.1664476076961146]
Uniform-precision neural network quantization has gained popularity since it simplifies densely packed arithmetic unit for high computing capability.
It ignores heterogeneous sensitivity to the impact of quantization errors across the layers, resulting in sub-optimal inference.
This work proposes a novel neural architecture search called neural channel expansion that adjusts the network structure to alleviate accuracy degradation from ultra-low uniform-precision quantization.
arXiv Detail & Related papers (2022-12-21T09:41:25Z) - Standard Deviation-Based Quantization for Deep Neural Networks [17.495852096822894]
Quantization of deep neural networks is a promising approach that reduces the inference cost.
We propose a new framework to learn the quantization intervals (discrete values) using the knowledge of the network's weight and activation distributions.
Our scheme simultaneously prunes the network's parameters and allows us to flexibly adjust the pruning ratio during the quantization process.
arXiv Detail & Related papers (2022-02-24T23:33:47Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - n-hot: Efficient bit-level sparsity for powers-of-two neural network
quantization [0.0]
Powers-of-two (PoT) quantization reduces the number of bit operations of deep neural networks on resource-constrained hardware.
PoT quantization triggers a severe accuracy drop because of its limited representation ability.
We propose an efficient PoT quantization scheme that balances accuracy and costs in a memory-efficient way.
arXiv Detail & Related papers (2021-03-22T10:13:12Z) - Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators.
We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z) - FATNN: Fast and Accurate Ternary Neural Networks [89.07796377047619]
Ternary Neural Networks (TNNs) have received much attention due to being potentially orders of magnitude faster in inference, as well as more power efficient, than full-precision counterparts.
In this work, we show that, under some mild constraints, computational complexity of the ternary inner product can be reduced by a factor of 2.
We elaborately design an implementation-dependent ternary quantization algorithm to mitigate the performance gap.
arXiv Detail & Related papers (2020-08-12T04:26:18Z) - WrapNet: Neural Net Inference with Ultra-Low-Resolution Arithmetic [57.07483440807549]
We propose a method that adapts neural networks to use low-resolution (8-bit) additions in the accumulators, achieving classification accuracy comparable to their 32-bit counterparts.
We demonstrate the efficacy of our approach on both software and hardware platforms.
arXiv Detail & Related papers (2020-07-26T23:18:38Z) - APQ: Joint Search for Network Architecture, Pruning and Quantization
Policy [49.3037538647714]
We present APQ for efficient deep learning inference on resource-constrained hardware.
Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner.
With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
arXiv Detail & Related papers (2020-06-15T16:09:17Z) - WaveQ: Gradient-Based Deep Quantization of Neural Networks through
Sinusoidal Adaptive Regularization [8.153944203144988]
We propose a novel sinusoidal regularization, called SINAREQ, for deep quantized training.
We show how SINAREQ balance compute efficiency and accuracy, and provide a heterogeneous bitwidth assignment for quantization of a large variety of deep networks.
arXiv Detail & Related papers (2020-02-29T01:19:55Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.