Trainable Fixed-Point Quantization for Deep Learning Acceleration on
FPGAs
- URL: http://arxiv.org/abs/2401.17544v1
- Date: Wed, 31 Jan 2024 02:18:27 GMT
- Title: Trainable Fixed-Point Quantization for Deep Learning Acceleration on
FPGAs
- Authors: Dingyi Dai, Yichi Zhang, Jiahao Zhang, Zhanqiu Hu, Yaohui Cai, Qi Sun,
Zhiru Zhang
- Abstract summary: Quantization is a crucial technique for deploying deep learning models on resource-constrained devices, such as embedded FPGAs.
We present QFX, a trainable fixed-point quantization approach that automatically learns the binary-point position during model training.
QFX is implemented as a PyTorch-based library that efficiently emulates fixed-point arithmetic, supported by FPGA HLS.
- Score: 30.325651150798915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantization is a crucial technique for deploying deep learning models on
resource-constrained devices, such as embedded FPGAs. Prior efforts mostly
focus on quantizing matrix multiplications, leaving other layers like BatchNorm
or shortcuts in floating-point form, even though fixed-point arithmetic is more
efficient on FPGAs. A common practice is to fine-tune a pre-trained model to
fixed-point for FPGA deployment, but potentially degrading accuracy.
This work presents QFX, a novel trainable fixed-point quantization approach
that automatically learns the binary-point position during model training.
Additionally, we introduce a multiplier-free quantization strategy within QFX
to minimize DSP usage. QFX is implemented as a PyTorch-based library that
efficiently emulates fixed-point arithmetic, supported by FPGA HLS, in a
differentiable manner during backpropagation. With minimal effort, models
trained with QFX can readily be deployed through HLS, producing the same
numerical results as their software counterparts. Our evaluation shows that
compared to post-training quantization, QFX can quantize models trained with
element-wise layers quantized to fewer bits and achieve higher accuracy on both
CIFAR-10 and ImageNet datasets. We further demonstrate the efficacy of
multiplier-free quantization using a state-of-the-art binarized neural network
accelerator designed for an embedded FPGA (AMD Xilinx Ultra96 v2). We plan to
release QFX in open-source format.
Related papers
- AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community.
We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z) - EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for
the Acceleration of Lightweight LLMs on the Edge [40.85258685379659]
Post-Training Quantization (PTQ) methods degrade in quality when quantizing weights, activations, and KV cache together to below 8 bits.
Many Quantization-Aware Training (QAT) works quantize model weights, leaving the activations untouched, which do not fully exploit the potential of quantization for inference acceleration on the edge.
We propose EdgeQAT, the Entropy and Distribution Guided QAT for the optimization of lightweight LLMs to achieve inference acceleration on Edge devices.
arXiv Detail & Related papers (2024-02-16T16:10:38Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance [49.1574468325115]
accumulator-aware quantization (A2Q) is a novel weight quantization method designed to train quantized neural networks (QNNs) to avoid overflow during inference.
A2Q introduces a unique formulation inspired by weight normalization that constrains the L1-norm of model weights according to accumulator bit width bounds.
We show A2Q can train QNNs for low-precision accumulators while maintaining model accuracy competitive with a floating-point baseline.
arXiv Detail & Related papers (2023-08-25T17:28:58Z) - ILMPQ : An Intra-Layer Multi-Precision Deep Neural Network Quantization
framework for FPGA [37.780528948703406]
This work targets the commonly used FPGA (field-programmable gate array) devices as the hardware platform for DNN edge computing.
We use a quantization method that supports multiple precisions along the intra-layer dimension.
We achieve 3.65x speedup in end-to-end inference time on the ImageNet, compared with the fixed-point quantization method.
arXiv Detail & Related papers (2021-10-30T03:02:52Z) - FAST: DNN Training Under Variable Precision Block Floating Point with
Stochastic Rounding [11.820523621760255]
Block Floating Point (BFP) can efficiently support quantization for Deep Neural Network (DNN) training.
We propose a Fast First, Accurate Second Training (FAST) system for DNNs, where the weights, activations, and gradients are represented in BFP.
arXiv Detail & Related papers (2021-10-28T22:24:33Z) - Towards Efficient Post-training Quantization of Pre-trained Language
Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues.
Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - MSP: An FPGA-Specific Mixed-Scheme, Multi-Precision Deep Neural Network
Quantization Framework [39.43144643349916]
This paper targets the commonly used FPGA devices as the hardware platforms for deep learning edge computing.
We propose a mixed-scheme DNN quantization method that incorporates both the linear and non-linear number systems for quantization.
We use a quantization method that supports multiple precisions along the intra-layer dimension, while the existing quantization methods apply multi-precision quantization along the inter-layer dimension.
arXiv Detail & Related papers (2020-09-16T04:24:18Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - A Learning Framework for n-bit Quantized Neural Networks toward FPGAs [20.83904734716565]
This paper proposes a novel learning framework for n-bit QNNs, whose weights are constrained to the power of two.
We also propose a novel QNN structure named n-BQ-NN, which uses shift operation to replace the multiply operation.
Experiments show that our n-BQ-NN with our SVPE can execute 2.9 times faster than with the vector processing element (VPE) in inference.
arXiv Detail & Related papers (2020-04-06T04:21:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.