An Empirical Study of Low Precision Quantization for TinyML
- URL: http://arxiv.org/abs/2203.05492v1
- Date: Thu, 10 Mar 2022 17:22:08 GMT
- Title: An Empirical Study of Low Precision Quantization for TinyML
- Authors: Shaojie Zhuo, Hongyu Chen, Ramchalam Kinattinkara Ramakrishnan, Tommy
Chen, Chen Feng, Yicheng Lin, Parker Zhang, Liang Shen
- Abstract summary: We focus on post-training quantization (PTQ) algorithms that quantize a model to low-bit (less than 8-bit) precision with only a small set of calibration data.
To achieve a fair comparison, we build a simulated quantization framework to investigate recent PTQ algorithms.
With ablation study on different alternatives of components in the pipeline, we reveal key design choices when performing low precision quantization.
- Score: 8.939851623894334
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Tiny machine learning (tinyML) has emerged during the past few years aiming
to deploy machine learning models to embedded AI processors with highly
constrained memory and computation capacity. Low precision quantization is an
important model compression technique that can greatly reduce both memory
consumption and computation cost of model inference. In this study, we focus on
post-training quantization (PTQ) algorithms that quantize a model to low-bit
(less than 8-bit) precision with only a small set of calibration data and
benchmark them on different tinyML use cases. To achieve a fair comparison, we
build a simulated quantization framework to investigate recent PTQ algorithms.
Furthermore, we break down those algorithms into essential components and
re-assembled a generic PTQ pipeline. With ablation study on different
alternatives of components in the pipeline, we reveal key design choices when
performing low precision quantization. We hope this work could provide useful
data points and shed lights on the future research of low precision
quantization.
Related papers
- Scaling Laws for Mixed quantization in Large Language Models [10.912306313183972]
Post-training quantization of Large Language Models (LLMs) has proven effective in reducing the computational requirements for running inference on these models.
In this study, we focus on a straightforward question: When aiming for a specific accuracy or perplexity target for low-precision quantization, how many high-precision numbers or calculations are required to preserve as we scale LLMs to larger sizes?
arXiv Detail & Related papers (2024-10-09T09:45:01Z) - ISQuant: apply squant to the real deployment [0.0]
We analyze why the combination of quantization and dequantization is used to train the model.
We propose ISQuant as a solution for deploying 8-bit models.
arXiv Detail & Related papers (2024-07-05T15:10:05Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - MixQuant: Mixed Precision Quantization with a Bit-width Optimization
Search [7.564770908909927]
Quantization is a technique for creating efficient Deep Neural Networks (DNNs)
We propose MixQuant, a search algorithm that finds the optimal custom quantization bit-width for each layer weight based on roundoff error.
We show that combining MixQuant with BRECQ, a state-of-the-art quantization method, yields better quantized model accuracy than BRECQ alone.
arXiv Detail & Related papers (2023-09-29T15:49:54Z) - A didactic approach to quantum machine learning with a single qubit [68.8204255655161]
We focus on the case of learning with a single qubit, using data re-uploading techniques.
We implement the different proposed formulations in toy and real-world datasets using the qiskit quantum computing SDK.
arXiv Detail & Related papers (2022-11-23T18:25:32Z) - End-to-end resource analysis for quantum interior point methods and portfolio optimization [63.4863637315163]
We provide a complete quantum circuit-level description of the algorithm from problem input to problem output.
We report the number of logical qubits and the quantity/depth of non-Clifford T-gates needed to run the algorithm.
arXiv Detail & Related papers (2022-11-22T18:54:48Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - MQBench: Towards Reproducible and Deployable Model Quantization
Benchmark [53.12623958951738]
MQBench is a first attempt to evaluate, analyze, and benchmark the and deployability for model quantization algorithms.
We choose multiple platforms for real-world deployments, including CPU, GPU, ASIC, DSP, and evaluate extensive state-of-the-art quantization algorithms.
We conduct a comprehensive analysis and find considerable intuitive or counter-intuitive insights.
arXiv Detail & Related papers (2021-11-05T23:38:44Z) - VecQ: Minimal Loss DNN Model Compression With Vectorized Weight
Quantization [19.66522714831141]
We develop a new quantization solution called VecQ, which can guarantee minimal direct quantization loss and better model accuracy.
In addition, in order to up the proposed quantization process during training, we accelerate the quantization process with a parameterized estimation and probability-based calculation.
arXiv Detail & Related papers (2020-05-18T07:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.