Related papers: Quantization of Deep Neural Networks for Accumulator-constrained Processors

Quantization of Deep Neural Networks for Accumulator-constrained Processors

URL: http://arxiv.org/abs/2004.11783v1
Date: Fri, 24 Apr 2020 14:47:14 GMT
Title: Quantization of Deep Neural Networks for Accumulator-constrained Processors
Authors: Barry de Bruin, Zoran Zivkovic, Henk Corporaal
Abstract summary: We introduce an Artificial Neural Network (ANN) quantization methodology for platforms without wide accumulation registers. We formulate the quantization problem as a function of accumulator size, and aim to maximize the model accuracy by maximizing bit width of input data and weights. We demonstrate that 16-bit accumulators are able to obtain a classification accuracy within 1% of the floating-point baselines on the CIFAR-10 and ILSVRC2012 image classification benchmarks.
Score: 2.8489574654566674
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce an Artificial Neural Network (ANN) quantization methodology for platforms without wide accumulation registers. This enables fixed-point model deployment on embedded compute platforms that are not specifically designed for large kernel computations (i.e. accumulator-constrained processors). We formulate the quantization problem as a function of accumulator size, and aim to maximize the model accuracy by maximizing bit width of input data and weights. To reduce the number of configurations to consider, only solutions that fully utilize the available accumulator bits are being tested. We demonstrate that 16-bit accumulators are able to obtain a classification accuracy within 1\% of the floating-point baselines on the CIFAR-10 and ILSVRC2012 image classification benchmarks. Additionally, a near-optimal $2\times$ speedup is obtained on an ARM processor, by exploiting 16-bit accumulators for image classification on the All-CNN-C and AlexNet networks.

Related papers

PQS (Prune, Quantize, and Sort): Low-Bitwidth Accumulation of Dot Products in Neural Network Computations [4.089232204089156]
We present PQS, which uses three techniques together - Prune, Quantize, and Sort - to achieve low-bitwidth accumulation of dot products in neural network computations. Our method offers a 2.5x reduction in accumulator bitwidth while achieving model accuracy on par with floating-point baselines for multiple image classification tasks.
arXiv Detail & Related papers (2025-04-12T03:51:42Z)
A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance [49.1574468325115]
accumulator-aware quantization (A2Q) is a novel weight quantization method designed to train quantized neural networks (QNNs) to avoid overflow during inference. A2Q introduces a unique formulation inspired by weight normalization that constrains the L1-norm of model weights according to accumulator bit width bounds. We show A2Q can train QNNs for low-precision accumulators while maintaining model accuracy competitive with a floating-point baseline.
arXiv Detail & Related papers (2023-08-25T17:28:58Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z)
Sub-bit Neural Networks: Learning to Compress and Accelerate Binary Neural Networks [72.81092567651395]
Sub-bit Neural Networks (SNNs) are a new type of binary quantization design tailored to compress and accelerate BNNs. SNNs are trained with a kernel-aware optimization framework, which exploits binary quantization in the fine-grained convolutional kernel space. Experiments on visual recognition benchmarks and the hardware deployment on FPGA validate the great potentials of SNNs.
arXiv Detail & Related papers (2021-10-18T11:30:29Z)
Quantized Neural Networks via {-1, +1} Encoding Decomposition and Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks. We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z)
On the quantization of recurrent neural networks [9.549757800469196]
quantization of neural networks can be defined as the approximation of the high precision computation of the canonical neural network formulation. We present an integer-only quantization strategy for Long Short-Term Memory (LSTM) neural network topologies.
arXiv Detail & Related papers (2021-01-14T04:25:08Z)
Binarization Methods for Motor-Imagery Brain-Computer Interface Classification [18.722731794073756]
We propose methods for transforming real-valued weights to binary numbers for efficient inference. By tuning the dimension of the binary embedding, we achieve almost the same accuracy in 4-class MI ($leq$1.27% lower) compared to models with float16 weights. Our method replaces the fully connected layer of CNNs with a binary augmented memory using bipolar random projection.
arXiv Detail & Related papers (2020-10-14T12:28:18Z)
Leveraging Automated Mixed-Low-Precision Quantization for tiny edge microcontrollers [76.30674794049293]
This paper presents an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices. Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors. Given an MCU-class memory bound to 2MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions.
arXiv Detail & Related papers (2020-08-12T06:09:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.