CSMPQ:Class Separability Based Mixed-Precision Quantization
- URL: http://arxiv.org/abs/2212.10220v1
- Date: Tue, 20 Dec 2022 12:52:19 GMT
- Title: CSMPQ:Class Separability Based Mixed-Precision Quantization
- Authors: Mingkai Wang, Taisong Jin, Miaohui Zhang, Zhengtao Yu
- Abstract summary: A novel mixed-precision quantization method, termed CSMPQ, is proposed.
Specifically, the TF-IDF metric that is widely used in natural language processing (NLP) is introduced to measure the class separability of layer-wise feature maps.
Without any iterative process, the proposed CSMPQ achieves better compression trade-offs than the state-of-the-art quantization methods.
- Score: 9.005098065862411
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixed-precision quantization has received increasing attention for its
capability of reducing the computational burden and speeding up the inference
time. Existing methods usually focus on the sensitivity of different network
layers, which requires a time-consuming search or training process. To this
end, a novel mixed-precision quantization method, termed CSMPQ, is proposed.
Specifically, the TF-IDF metric that is widely used in natural language
processing (NLP) is introduced to measure the class separability of layer-wise
feature maps. Furthermore, a linear programming problem is designed to derive
the optimal bit configuration for each layer. Without any iterative process,
the proposed CSMPQ achieves better compression trade-offs than the
state-of-the-art quantization methods. Specifically, CSMPQ achieves 73.03$\%$
Top-1 acc on ResNet-18 with only 59G BOPs for QAT, and 71.30$\%$ top-1 acc with
only 1.5Mb on MobileNetV2 for PTQ.
Related papers
- PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.
We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.
Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z) - EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.
We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.
EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z) - Gradient-Based Post-Training Quantization: Challenging the Status Quo [23.1120983784623]
Quantization has become a crucial step for the efficient deployment of deep neural networks.
In this work, we show that the process is, to a certain extent, robust to a number of variables.
We derive a number of best practices for designing more efficient and scalable GPTQ methods.
arXiv Detail & Related papers (2023-08-15T09:25:11Z) - Mixed-Precision Quantization with Cross-Layer Dependencies [6.338965603383983]
Mixed-precision quantization (MPQ) assigns varied bit-widths to layers to optimize the accuracy-efficiency trade-off.
Existing methods simplify the MPQ problem by assuming that quantization errors at different layers act independently.
We show that this assumption does not reflect the true behavior of quantized deep neural networks.
arXiv Detail & Related papers (2023-07-11T15:56:00Z) - Solving Oscillation Problem in Post-Training Quantization Through a
Theoretical Perspective [74.48124653728422]
Post-training quantization (PTQ) is widely regarded as one of the most efficient compression methods practically.
We argue that an overlooked problem of oscillation is in the PTQ methods.
arXiv Detail & Related papers (2023-03-21T14:52:52Z) - CADyQ: Content-Aware Dynamic Quantization for Image Super-Resolution [55.50793823060282]
We propose a novel Content-Aware Dynamic Quantization (CADyQ) method for image super-resolution (SR) networks.
CADyQ allocates optimal bits to local regions and layers adaptively based on the local contents of an input image.
The pipeline has been tested on various SR networks and evaluated on several standard benchmarks.
arXiv Detail & Related papers (2022-07-21T07:50:50Z) - SDQ: Stochastic Differentiable Quantization with Mixed Precision [46.232003346732064]
We present a novel Differentiable Quantization (SDQ) method that can automatically learn the MPQ strategy.
After the optimal MPQ strategy is acquired, we train our network with entropy-aware bin regularization and knowledge distillation.
SDQ outperforms all state-of-the-art mixed datasets or single precision quantization with a lower bitwidth.
arXiv Detail & Related papers (2022-06-09T12:38:18Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - Optimal Qubit Mapping with Simultaneous Gate Absorption [9.530683922512873]
A key step in compilation is mapping the qubits in the program to physical qubits on a given quantum computer.
We present OLSQ-GA, an optimal qubit mapper with a key feature of simultaneous SWAP gate absorption.
OLSQ-GA reduces depth by up to 50.0% and SWAP count by 100% compared to other state-of-the-art methods.
arXiv Detail & Related papers (2021-09-14T05:15:36Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network
Quantization [32.770842274996774]
Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks.
Previous methods either examine only a small manually-designed search space or utilize a cumbersome neural architecture search to explore the vast search space.
This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity.
arXiv Detail & Related papers (2021-02-20T22:37:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.