MSP: An FPGA-Specific Mixed-Scheme, Multi-Precision Deep Neural Network
Quantization Framework
- URL: http://arxiv.org/abs/2009.07460v2
- Date: Sat, 17 Oct 2020 01:58:38 GMT
- Title: MSP: An FPGA-Specific Mixed-Scheme, Multi-Precision Deep Neural Network
Quantization Framework
- Authors: Sung-En Chang, Yanyu Li, Mengshu Sun, Weiwen Jiang, Runbin Shi, Xue
Lin, Yanzhi Wang
- Abstract summary: This paper targets the commonly used FPGA devices as the hardware platforms for deep learning edge computing.
We propose a mixed-scheme DNN quantization method that incorporates both the linear and non-linear number systems for quantization.
We use a quantization method that supports multiple precisions along the intra-layer dimension, while the existing quantization methods apply multi-precision quantization along the inter-layer dimension.
- Score: 39.43144643349916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the tremendous success of deep learning, there exists imminent need to
deploy deep learning models onto edge devices. To tackle the limited computing
and storage resources in edge devices, model compression techniques have been
widely used to trim deep neural network (DNN) models for on-device inference
execution. This paper targets the commonly used FPGA (field programmable gate
array) devices as the hardware platforms for DNN edge computing. We focus on
the DNN quantization as the main model compression technique, since DNN
quantization has been of great importance for the implementations of DNN models
on the hardware platforms. The novelty of this work comes in twofold: (i) We
propose a mixed-scheme DNN quantization method that incorporates both the
linear and non-linear number systems for quantization, with the aim to boost
the utilization of the heterogeneous computing resources, i.e., LUTs (look up
tables) and DSPs (digital signal processors) on an FPGA. Note that all the
existing (single-scheme) quantization methods can only utilize one type of
resources (either LUTs or DSPs for the MAC (multiply-accumulate) operations in
deep learning computations. (ii) We use a quantization method that supports
multiple precisions along the intra-layer dimension, while the existing
quantization methods apply multi-precision quantization along the inter-layer
dimension. The intra-layer multi-precision method can uniform the hardware
configurations for different layers to reduce computation overhead and at the
same time preserve the model accuracy as the inter-layer approach.
Related papers
- Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference [4.093167352780157]
We introduce Logarithmic Posits (LP), an adaptive, hardware-friendly data type inspired by posits.
We also develop a novel genetic-algorithm based framework, LP Quantization (LPQ), to find optimal layer-wise LP parameters.
arXiv Detail & Related papers (2024-03-08T17:28:49Z) - End-to-end codesign of Hessian-aware quantized neural networks for FPGAs
and ASICs [49.358119307844035]
We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs)
This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow.
We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the Large Hadron Collider (LHC)
We implement an optimized mixed-precision NN for high-momentum particle jets in simulated LHC proton-proton collisions.
arXiv Detail & Related papers (2023-04-13T18:00:01Z) - Decomposition of Matrix Product States into Shallow Quantum Circuits [62.5210028594015]
tensor network (TN) algorithms can be mapped to parametrized quantum circuits (PQCs)
We propose a new protocol for approximating TN states using realistic quantum circuits.
Our results reveal one particular protocol, involving sequential growth and optimization of the quantum circuit, to outperform all other methods.
arXiv Detail & Related papers (2022-09-01T17:08:41Z) - Edge Inference with Fully Differentiable Quantized Mixed Precision
Neural Networks [1.131071436917293]
Quantizing parameters and operations to lower bit-precision offers substantial memory and energy savings for neural network inference.
This paper proposes a new quantization approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing.
arXiv Detail & Related papers (2022-06-15T18:11:37Z) - A Comprehensive Survey on Model Quantization for Deep Neural Networks in
Image Classification [0.0]
A promising approach is quantization, in which the full-precision values are stored in low bit-width precision.
We present a comprehensive survey of quantization concepts and methods, with a focus on image classification.
We explain the replacement of floating-point operations with low-cost bitwise operations in a quantized DNN and the sensitivity of different layers in quantization.
arXiv Detail & Related papers (2022-05-14T15:08:32Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - Low-bit Quantization of Recurrent Neural Network Language Models Using
Alternating Direction Methods of Multipliers [67.688697838109]
This paper presents a novel method to train quantized RNNLMs from scratch using alternating direction methods of multipliers (ADMM)
Experiments on two tasks suggest the proposed ADMM quantization achieved a model size compression factor of up to 31 times over the full precision baseline RNNLMs.
arXiv Detail & Related papers (2021-11-29T09:30:06Z) - ILMPQ : An Intra-Layer Multi-Precision Deep Neural Network Quantization
framework for FPGA [37.780528948703406]
This work targets the commonly used FPGA (field-programmable gate array) devices as the hardware platform for DNN edge computing.
We use a quantization method that supports multiple precisions along the intra-layer dimension.
We achieve 3.65x speedup in end-to-end inference time on the ImageNet, compared with the fixed-point quantization method.
arXiv Detail & Related papers (2021-10-30T03:02:52Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization
Framework [39.981546951333556]
This paper focuses on weight quantization, a hardware-friendly model compression approach.
It is motivated by (1) the distribution of the weights in the different rows are not the same; and (2) the potential of achieving better utilization of FPGA hardware resources.
arXiv Detail & Related papers (2020-12-08T06:25:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.