Related papers: QuantuneV2: Compiler-Based Local Metric-Driven Mixed Precision Quantization for Practical Embedded AI Applications

QuantuneV2: Compiler-Based Local Metric-Driven Mixed Precision Quantization for Practical Embedded AI Applications

URL: http://arxiv.org/abs/2501.07161v1
Date: Mon, 13 Jan 2025 09:41:54 GMT
Title: QuantuneV2: Compiler-Based Local Metric-Driven Mixed Precision Quantization for Practical Embedded AI Applications
Authors: Jeongseok Kim, Jemin Lee, Yongin Kwon, Daeyoung Kim,
Abstract summary: QuantuneV2 is a compiler-based mixed-precision quantization method for practical embedded AI applications.<n>We show that QuantuneV2 achieved up to a 10.28 percent improvement in accuracy and a 12.52 percent increase in speed compared to existing methods.
Score: 14.388990959056962
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Mixed-precision quantization methods have been proposed to reduce model size while minimizing accuracy degradation. However, existing studies require retraining and do not consider the computational overhead and intermediate representations (IR) generated during the compilation process, limiting their application at the compiler level. This computational overhead refers to the runtime latency caused by frequent quantization and dequantization operations during inference. Performing these operations at the individual operator level causes significant runtime delays. To address these issues, we propose QuantuneV2, a compiler-based mixed-precision quantization method designed for practical embedded AI applications. QuantuneV2 performs inference only twice, once before quantization and once after quantization, and operates with a computational complexity of O(n) that increases linearly with the number of model parameters. We also made the sensitivity analysis more stable by using local metrics like weights, activation values, the Signal to Quantization Noise Ratio, and the Mean Squared Error. We also cut down on computational overhead by choosing the best IR and using operator fusion. Experimental results show that QuantuneV2 achieved up to a 10.28 percent improvement in accuracy and a 12.52 percent increase in speed compared to existing methods across five models: ResNet18v1, ResNet50v1, SqueezeNetv1, VGGNet, and MobileNetv2. This demonstrates that QuantuneV2 enhances model performance while maintaining computational efficiency, making it suitable for deployment in embedded AI environments.

Related papers

SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs [4.946856266233001]
SignRoundV2 is a post-training quantization framework that is highly effective even without mixed-precision.<n>Our method sustains competitive accuracy for Large Language Models, achieving production-grade performance with about 1 percent variance at 4-5 bits.
arXiv Detail & Related papers (2025-12-04T12:35:10Z)
Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference [3.7687375904925484]
We propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation.<n>We develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead.
arXiv Detail & Related papers (2025-05-20T17:26:12Z)
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs. Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost. We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z)
SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs [10.036727981085223]
SplitQuantV2 is an innovative algorithm designed to enhance low-bit linear quantization of large language models. It can achieve results comparable to those of advanced algorithms.
arXiv Detail & Related papers (2025-03-07T14:59:07Z)
Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks [10.229120811024162]
deep neural networks (DNNs) pose significant challenges to their deployment on edge devices. Common approaches to address this issue are pruning and mixed-precision quantization. We propose a novel methodology to apply them jointly via a lightweight gradient-based search.
arXiv Detail & Related papers (2024-07-01T08:07:02Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models [9.727062803700264]
We introduce LUT-GEMM, an efficient kernel for quantized matrix multiplication. LUT-GEMM eliminates the resource-intensive dequantization process and reduces computational costs. We show experimentally that when applied to the OPT-175B model with 3-bit quantization, LUT-GEMM substantially accelerates token generation latency.
arXiv Detail & Related papers (2022-06-20T03:48:17Z)
Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization [35.198615417316056]
We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T) We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model. We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance.
arXiv Detail & Related papers (2022-06-16T02:17:49Z)
OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization. We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming. This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z)
Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models. We show negligible WER change as compared to the full-precision baseline models. Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z)
Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z)
AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation. Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)
Accelerating Neural Network Inference by Overflow Aware Quantization [16.673051600608535]
Inherited heavy computation of deep neural networks prevents their widespread applications. We propose an overflow aware quantization method by designing trainable adaptive fixed-point representation. With the proposed method, we are able to fully utilize the computing power to minimize the quantization loss and obtain optimized inference performance.
arXiv Detail & Related papers (2020-05-27T11:56:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.