Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization
Framework
- URL: http://arxiv.org/abs/2012.04240v2
- Date: Sat, 12 Dec 2020 00:54:47 GMT
- Title: Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization
Framework
- Authors: Sung-En Chang, Yanyu Li, Mengshu Sun, Runbin Shi, Hayden K.-H. So,
Xuehai Qian, Yanzhi Wang, Xue Lin
- Abstract summary: This paper focuses on weight quantization, a hardware-friendly model compression approach.
It is motivated by (1) the distribution of the weights in the different rows are not the same; and (2) the potential of achieving better utilization of FPGA hardware resources.
- Score: 39.981546951333556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Neural Networks (DNNs) have achieved extraordinary performance in
various application domains. To support diverse DNN models, efficient
implementations of DNN inference on edge-computing platforms, e.g., ASICs,
FPGAs, and embedded systems, are extensively investigated. Due to the huge
model size and computation amount, model compression is a critical step to
deploy DNN models on edge devices. This paper focuses on weight quantization, a
hardware-friendly model compression approach that is complementary to weight
pruning. Unlike existing methods that use the same quantization scheme for all
weights, we propose the first solution that applies different quantization
schemes for different rows of the weight matrix. It is motivated by (1) the
distribution of the weights in the different rows are not the same; and (2) the
potential of achieving better utilization of heterogeneous FPGA hardware
resources. To achieve that, we first propose a hardware-friendly quantization
scheme named sum-of-power-of-2 (SP2) suitable for Gaussian-like weight
distribution, in which the multiplication arithmetic can be replaced with logic
shifter and adder, thereby enabling highly efficient implementations with the
FPGA LUT resources. In contrast, the existing fixed-point quantization is
suitable for Uniform-like weight distribution and can be implemented
efficiently by DSP. Then to fully explore the resources, we propose an
FPGA-centric mixed scheme quantization (MSQ) with an ensemble of the proposed
SP2 and the fixed-point schemes. Combining the two schemes can maintain, or
even increase accuracy due to better matching with weight distributions.
Related papers
- Stochastic Configuration Machines: FPGA Implementation [4.57421617811378]
configuration networks (SCNs) are a prime choice in industrial applications due to their merits and feasibility for data modelling.
This paper aims to implement SCM models on a field programmable gate array (FPGA) and introduce binary-coded inputs to improve learning performance.
arXiv Detail & Related papers (2023-10-30T02:04:20Z) - A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance [49.1574468325115]
accumulator-aware quantization (A2Q) is a novel weight quantization method designed to train quantized neural networks (QNNs) to avoid overflow during inference.
A2Q introduces a unique formulation inspired by weight normalization that constrains the L1-norm of model weights according to accumulator bit width bounds.
We show A2Q can train QNNs for low-precision accumulators while maintaining model accuracy competitive with a floating-point baseline.
arXiv Detail & Related papers (2023-08-25T17:28:58Z) - EQ-Net: Elastic Quantization Neural Networks [15.289359357583079]
Elastic Quantization Neural Networks (EQ-Net) aims to train a robust weight-sharing quantization supernet.
We propose an elastic quantization space (including elastic bit-width, granularity, and symmetry) to adapt to various mainstream quantitative forms.
We incorporate genetic algorithms and the proposed Conditional Quantization-Aware Conditional Accuracy Predictor (CQAP) as an estimator to quickly search mixed-precision quantized neural networks in supernet.
arXiv Detail & Related papers (2023-08-15T08:57:03Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one.
We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z) - Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via
Generalized Straight-Through Estimation [48.838691414561694]
Nonuniform-to-Uniform Quantization (N2UQ) is a method that can maintain the strong representation ability of nonuniform methods while being hardware-friendly and efficient.
N2UQ outperforms state-of-the-art nonuniform quantization methods by 0.71.8% on ImageNet.
arXiv Detail & Related papers (2021-11-29T18:59:55Z) - ILMPQ : An Intra-Layer Multi-Precision Deep Neural Network Quantization
framework for FPGA [37.780528948703406]
This work targets the commonly used FPGA (field-programmable gate array) devices as the hardware platform for DNN edge computing.
We use a quantization method that supports multiple precisions along the intra-layer dimension.
We achieve 3.65x speedup in end-to-end inference time on the ImageNet, compared with the fixed-point quantization method.
arXiv Detail & Related papers (2021-10-30T03:02:52Z) - MSP: An FPGA-Specific Mixed-Scheme, Multi-Precision Deep Neural Network
Quantization Framework [39.43144643349916]
This paper targets the commonly used FPGA devices as the hardware platforms for deep learning edge computing.
We propose a mixed-scheme DNN quantization method that incorporates both the linear and non-linear number systems for quantization.
We use a quantization method that supports multiple precisions along the intra-layer dimension, while the existing quantization methods apply multi-precision quantization along the inter-layer dimension.
arXiv Detail & Related papers (2020-09-16T04:24:18Z) - AUSN: Approximately Uniform Quantization by Adaptively Superimposing
Non-uniform Distribution for Deep Neural Networks [0.7378164273177589]
Existing uniform and non-uniform quantization methods exhibit an inherent conflict between the representing range and representing resolution.
We propose a novel quantization method to quantize the weight and activation.
The key idea is to Approximate the Uniform quantization by Adaptively Superposing multiple Non-uniform quantized values, namely AUSN.
arXiv Detail & Related papers (2020-07-08T05:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.