Related papers: BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization

BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization

URL: http://arxiv.org/abs/2102.10462v1
Date: Sat, 20 Feb 2021 22:37:41 GMT
Title: BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization
Authors: Huanrui Yang, Lin Duan, Yiran Chen, Hai Li
Abstract summary: Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks. Previous methods either examine only a small manually-designed search space or utilize a cumbersome neural architecture search to explore the vast search space. This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity.
Score: 32.770842274996774
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks, and thus, have been widely investigated. However, it lacks a systematic method to determine the exact quantization scheme. Previous methods either examine only a small manually-designed search space or utilize a cumbersome neural architecture search to explore the vast search space. These approaches cannot lead to an optimal quantization scheme efficiently. This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity. We consider each bit of quantized weights as an independent trainable variable and introduce a differentiable bit-sparsity regularizer. BSQ can induce all-zero bits across a group of weight elements and realize the dynamic precision reduction, leading to a mixed-precision quantization scheme of the original model. Our method enables the exploration of the full mixed-precision space with a single gradient-based optimization process, with only one hyperparameter to tradeoff the performance and compression. BSQ achieves both higher accuracy and higher bit reduction on various model architectures on the CIFAR-10 and ImageNet datasets comparing to previous methods.

Related papers

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search [7.564770908909927]
Quantization is a technique for creating efficient Deep Neural Networks (DNNs) We propose MixQuant, a search algorithm that finds the optimal custom quantization bit-width for each layer weight based on roundoff error. We show that combining MixQuant with BRECQ, a state-of-the-art quantization method, yields better quantized model accuracy than BRECQ alone.
arXiv Detail & Related papers (2023-09-29T15:49:54Z)
Mixed-Precision Quantization for Deep Vision Models with Integer Quadratic Programming [7.0146264551420066]
Quantization is a widely used technique to compress neural networks. MPQ addresses this by assigning varied bit-widths to layers, optimizing the accuracy-efficiency trade-off. We introduce CLADO, a practical sensitivity-based MPQ algorithm that captures crosslayer dependency of quantization error.
arXiv Detail & Related papers (2023-07-11T15:56:00Z)
CSQ: Growing Mixed-Precision Quantization Scheme with Bi-level Continuous Sparsification [51.81850995661478]
Mixed-precision quantization has been widely applied on deep neural networks (DNNs) Previous attempts on bit-level regularization and pruning-based dynamic precision adjustment during training suffer from noisy gradients and unstable convergence. We propose Continuous Sparsification Quantization (CSQ), a bit-level training method to search for mixed-precision quantization schemes with improved stability.
arXiv Detail & Related papers (2022-12-06T05:44:21Z)
SDQ: Stochastic Differentiable Quantization with Mixed Precision [46.232003346732064]
We present a novel Differentiable Quantization (SDQ) method that can automatically learn the MPQ strategy. After the optimal MPQ strategy is acquired, we train our network with entropy-aware bin regularization and knowledge distillation. SDQ outperforms all state-of-the-art mixed datasets or single precision quantization with a lower bitwidth.
arXiv Detail & Related papers (2022-06-09T12:38:18Z)
Post-training Quantization for Neural Networks with Provable Guarantees [9.58246628652846]
We modify a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism. We prove that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights.
arXiv Detail & Related papers (2022-01-26T18:47:38Z)
Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks. These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices. We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z)
Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications. Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors. Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z)
Cluster-Promoting Quantization with Bit-Drop for Minimizing Network Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks. DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons. We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z)
Effective and Fast: A Novel Sequential Single Path Search for Mixed-Precision Quantization [45.22093693422085]
Mixed-precision quantization model can match different quantization bit-precisions according to the sensitivity of different layers to achieve great performance. It is a difficult problem to quickly determine the quantization bit-precision of each layer in deep neural networks according to some constraints. We propose a novel sequential single path search (SSPS) method for mixed-precision quantization.
arXiv Detail & Related papers (2021-03-04T09:15:08Z)
Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators. We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.