Related papers: AutoQNN: An End-to-End Framework for Automatically Quantizing Neural Networks

AutoQNN: An End-to-End Framework for Automatically Quantizing Neural Networks

URL: http://arxiv.org/abs/2304.03782v1
Date: Fri, 7 Apr 2023 11:14:21 GMT
Title: AutoQNN: An End-to-End Framework for Automatically Quantizing Neural Networks
Authors: Cheng Gong, Ye Lu, Surong Dai, Deng Qian, Chenkun Du, Tao Li
Abstract summary: We propose an end-to-end framework named AutoQNN, for automatically quantizing different layers utilizing different schemes and bitwidths without any human labor. QPL is the first method to learn mixed-precision policies by re parameterizing the bitwidths of quantizing schemes. QAG is designed to convert arbitrary architectures into corresponding quantized ones without manual intervention.
Score: 6.495218751128902
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Exploring the expected quantizing scheme with suitable mixed-precision policy is the key point to compress deep neural networks (DNNs) in high efficiency and accuracy. This exploration implies heavy workloads for domain experts, and an automatic compression method is needed. However, the huge search space of the automatic method introduces plenty of computing budgets that make the automatic process challenging to be applied in real scenarios. In this paper, we propose an end-to-end framework named AutoQNN, for automatically quantizing different layers utilizing different schemes and bitwidths without any human labor. AutoQNN can seek desirable quantizing schemes and mixed-precision policies for mainstream DNN models efficiently by involving three techniques: quantizing scheme search (QSS), quantizing precision learning (QPL), and quantized architecture generation (QAG). QSS introduces five quantizing schemes and defines three new schemes as a candidate set for scheme search, and then uses the differentiable neural architecture search (DNAS) algorithm to seek the layer- or model-desired scheme from the set. QPL is the first method to learn mixed-precision policies by reparameterizing the bitwidths of quantizing schemes, to the best of our knowledge. QPL optimizes both classification loss and precision loss of DNNs efficiently and obtains the relatively optimal mixed-precision model within limited model size and memory footprint. QAG is designed to convert arbitrary architectures into corresponding quantized ones without manual intervention, to facilitate end-to-end neural network quantization. We have implemented AutoQNN and integrated it into Keras. Extensive experiments demonstrate that AutoQNN can consistently outperform state-of-the-art quantization.

Related papers

AdaQAT: Adaptive Bit-Width Quantization-Aware Training [0.873811641236639]
Large-scale deep neural networks (DNNs) have achieved remarkable success in many application scenarios. Model quantization is a common approach to deal with deployment constraints, but searching for optimized bit-widths can be challenging. We present Adaptive Bit-Width Quantization Aware Training (AdaQAT), a learning-based method that automatically optimize bit-widths during training for more efficient inference.
arXiv Detail & Related papers (2024-04-22T09:23:56Z)
Optimizing Quantum Convolutional Neural Network Architectures for Arbitrary Data Dimension [2.9396076967931526]
Quantum convolutional neural networks (QCNNs) represent a promising approach in quantum machine learning. We propose a QCNN architecture capable of handling arbitrary input data dimensions while optimizing the allocation of quantum resources.
arXiv Detail & Related papers (2024-03-28T02:25:12Z)
Adaptive quantization with mixed-precision based on low-cost proxy [8.527626602939105]
This paper proposes a novel model quantization method, named the Low-Cost Proxy-Based Adaptive Mixed-Precision Model Quantization (LCPAQ) The hardware-aware module is designed by considering the hardware limitations, while an adaptive mixed-precision quantization module is developed to evaluate the quantization sensitivity. Experiments on the ImageNet demonstrate that the proposed LCPAQ achieves comparable or superior quantization accuracy to existing mixed-precision models.
arXiv Detail & Related papers (2024-02-27T17:36:01Z)
QVIP: An ILP-based Formal Verification Approach for Quantized Neural Networks [14.766917269393865]
Quantization has emerged as a promising technique to reduce the size of neural networks with comparable accuracy as their floating-point numbered counterparts. We propose a novel and efficient formal verification approach for QNNs. In particular, we are the first to propose an encoding that reduces the verification problem of QNNs into the solving of integer linear constraints.
arXiv Detail & Related papers (2022-12-10T03:00:29Z)
Q-SpiNN: A Framework for Quantizing Spiking Neural Networks [14.727296040550392]
A prominent technique for reducing the memory footprint of Spiking Neural Networks (SNNs) without decreasing the accuracy significantly is quantization. We propose Q-SpiNN, a novel quantization framework for memory-efficient SNNs. For the unsupervised network, the Q-SpiNN reduces the memory footprint by ca. 4x, while maintaining the accuracy within 1% from the baseline on the MNIST dataset. For the supervised network, the Q-SpiNN reduces the memory by ca. 2x, while keeping the accuracy within 2% from the baseline on the DVS-Gesture dataset
arXiv Detail & Related papers (2021-07-05T06:01:15Z)
Quantized Neural Networks via {-1, +1} Encoding Decomposition and Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks. We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z)
Decentralizing Feature Extraction with Quantum Convolutional Neural Network for Automatic Speech Recognition [101.69873988328808]
We build upon a quantum convolutional neural network (QCNN) composed of a quantum circuit encoder for feature extraction. An input speech is first up-streamed to a quantum computing server to extract Mel-spectrogram. The corresponding convolutional features are encoded using a quantum circuit algorithm with random parameters. The encoded features are then down-streamed to the local RNN model for the final recognition.
arXiv Detail & Related papers (2020-10-26T03:36:01Z)
Once Quantization-Aware Training: High Performance Extremely Low-bit Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides. We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models. Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z)
MS-RANAS: Multi-Scale Resource-Aware Neural Architecture Search [94.80212602202518]
We propose Multi-Scale Resource-Aware Neural Architecture Search (MS-RANAS) We employ a one-shot architecture search approach in order to obtain a reduced search cost. We achieve state-of-the-art results in terms of accuracy-speed trade-off.
arXiv Detail & Related papers (2020-09-29T11:56:01Z)
APQ: Joint Search for Network Architecture, Pruning and Quantization Policy [49.3037538647714]
We present APQ for efficient deep learning inference on resource-constrained hardware. Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner. With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
arXiv Detail & Related papers (2020-06-15T16:09:17Z)
Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters. Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques. We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.