PIM-QAT: Neural Network Quantization for Processing-In-Memory (PIM)
Systems
- URL: http://arxiv.org/abs/2209.08617v1
- Date: Sun, 18 Sep 2022 17:51:55 GMT
- Title: PIM-QAT: Neural Network Quantization for Processing-In-Memory (PIM)
Systems
- Authors: Qing Jin, Zhiyu Chen, Jian Ren, Yanyu Li, Yanzhi Wang, Kaiyuan Yang
- Abstract summary: We propose a PIM quantization aware training (PIM-QAT) algorithm, and introduce rescaling techniques to facilitate training convergence.
We also propose two techniques, namely batch normalization (BN) calibration and adjusted precision training, to suppress the adverse effects of non-ideal linearity and thermal noise involved in real PIM chips.
- Score: 36.35995812401125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Processing-in-memory (PIM), an increasingly studied neuromorphic hardware,
promises orders of energy and throughput improvements for deep learning
inference. Leveraging the massively parallel and efficient analog computing
inside memories, PIM circumvents the bottlenecks of data movements in
conventional digital hardware. However, an extra quantization step (i.e. PIM
quantization), typically with limited resolution due to hardware constraints,
is required to convert the analog computing results into digital domain.
Meanwhile, non-ideal effects extensively exist in PIM quantization because of
the imperfect analog-to-digital interface, which further compromises the
inference accuracy.
In this paper, we propose a method for training quantized networks to
incorporate PIM quantization, which is ubiquitous to all PIM systems.
Specifically, we propose a PIM quantization aware training (PIM-QAT) algorithm,
and introduce rescaling techniques during backward and forward propagation by
analyzing the training dynamics to facilitate training convergence. We also
propose two techniques, namely batch normalization (BN) calibration and
adjusted precision training, to suppress the adverse effects of non-ideal
linearity and stochastic thermal noise involved in real PIM chips. Our method
is validated on three mainstream PIM decomposition schemes, and physically on a
prototype chip. Comparing with directly deploying conventionally trained
quantized model on PIM systems, which does not take into account this extra
quantization step and thus fails, our method provides significant improvement.
It also achieves comparable inference accuracy on PIM systems as that of
conventionally quantized models on digital hardware, across CIFAR10 and
CIFAR100 datasets using various network depths for the most popular network
topology.
Related papers
- EPIM: Efficient Processing-In-Memory Accelerators based on Epitome [78.79382890789607]
We introduce the Epitome, a lightweight neural operator offering convolution-like functionality.
On the software side, we evaluate epitomes' latency and energy on PIM accelerators.
We introduce a PIM-aware layer-wise design method to enhance their hardware efficiency.
arXiv Detail & Related papers (2023-11-12T17:56:39Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - Quantization Aware Factorization for Deep Neural Network Compression [20.04951101799232]
decomposition of convolutional and fully-connected layers is an effective way to reduce parameters and FLOP in neural networks.
A conventional post-training quantization approach applied to networks with weights yields a drop in accuracy.
This motivated us to develop an algorithm that finds decomposed approximation directly with quantized factors.
arXiv Detail & Related papers (2023-08-08T21:38:02Z) - Bulk-Switching Memristor-based Compute-In-Memory Module for Deep Neural
Network Training [15.660697326769686]
We propose a mixed-precision training scheme for memristor-based compute-in-memory (CIM) modules.
The proposed scheme is implemented with a system-on-chip (SoC) of fully integrated analog CIM modules and digital sub-systems.
The efficacy of training larger models is evaluated using realistic hardware parameters and shows that analog CIM modules can enable efficient mix-precision training with accuracy comparable to full-precision software trained models.
arXiv Detail & Related papers (2023-05-23T22:03:08Z) - Decomposition of Matrix Product States into Shallow Quantum Circuits [62.5210028594015]
tensor network (TN) algorithms can be mapped to parametrized quantum circuits (PQCs)
We propose a new protocol for approximating TN states using realistic quantum circuits.
Our results reveal one particular protocol, involving sequential growth and optimization of the quantum circuit, to outperform all other methods.
arXiv Detail & Related papers (2022-09-01T17:08:41Z) - AMED: Automatic Mixed-Precision Quantization for Edge Devices [3.5223695602582614]
Quantized neural networks are well known for reducing the latency, power consumption, and model size without significant harm to the performance.
Mixed-precision quantization offers better utilization of customized hardware that supports arithmetic operations at different bitwidths.
arXiv Detail & Related papers (2022-05-30T21:23:22Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications.
We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z) - MSP: An FPGA-Specific Mixed-Scheme, Multi-Precision Deep Neural Network
Quantization Framework [39.43144643349916]
This paper targets the commonly used FPGA devices as the hardware platforms for deep learning edge computing.
We propose a mixed-scheme DNN quantization method that incorporates both the linear and non-linear number systems for quantization.
We use a quantization method that supports multiple precisions along the intra-layer dimension, while the existing quantization methods apply multi-precision quantization along the inter-layer dimension.
arXiv Detail & Related papers (2020-09-16T04:24:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.