Related papers: Mix-QViT: Mixed-Precision Vision Transformer Quantization Driven by Layer Importance and Quantization Sensitivity

Mix-QViT: Mixed-Precision Vision Transformer Quantization Driven by Layer Importance and Quantization Sensitivity

URL: http://arxiv.org/abs/2501.06357v1
Date: Fri, 10 Jan 2025 21:36:20 GMT
Title: Mix-QViT: Mixed-Precision Vision Transformer Quantization Driven by Layer Importance and Quantization Sensitivity
Authors: Navin Ranjan, Andreas Savakis,
Abstract summary: Mix-QViT is an explainability-driven MPQ framework that allocates bit-widths to each layer based on two criteria.<n>For post-training quantization, we introduce a clipped channel-wise quantization method.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we propose Mix-QViT, an explainability-driven MPQ framework that systematically allocates bit-widths to each layer based on two criteria: layer importance, assessed via Layer-wise Relevance Propagation (LRP), which identifies how much each layer contributes to the final classification, and quantization sensitivity, determined by evaluating the performance impact of quantizing each layer at various precision levels while keeping others layers at a baseline. Additionally, for post-training quantization (PTQ), we introduce a clipped channel-wise quantization method designed to reduce the effects of extreme outliers in post-LayerNorm activations by removing severe inter-channel variations. We validate our approach by applying Mix-QViT to ViT, DeiT, and Swin Transformer models across multiple datasets. Our experimental results for PTQ demonstrate that both fixed-bit and mixed-bit methods outperform existing techniques, particularly at 3-bit, 4-bit, and 6-bit precision. Furthermore, in quantization-aware training, Mix-QViT achieves superior performance with 2-bit mixed-precision.

Related papers

MLoRQ: Bridging Low-Rank and Quantization for Transformer Compression [2.9907287985468924]
Mixed Low-Rank and Quantization (MLoRQ) is a novel method that integrates low-rank approximation and mixed-precision quantization.<n>MLoRQ shows state-of-the-art results with up to 15% performance improvement.
arXiv Detail & Related papers (2025-07-13T12:48:46Z)
Layer-wise Quantization for Quantized Optimistic Dual Averaging [75.4148236967503]
We develop a general layer-wise quantization framework with tight variance and code-length bounds, adapting to the heterogeneities over the course of training.<n>We propose a novel Quantized Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which achieves competitive convergence rates for monotone VIs.
arXiv Detail & Related papers (2025-05-20T13:53:58Z)
Mix-QSAM: Mixed-Precision Quantization of the Segment Anything Model [0.0]
Mix-QSAM is a mixed-precision Post-Training Quantization (PTQ) framework for the Segment Anything Model (SAM)<n>We introduce a layer-wise importance score, derived using Kullback-Leibler (KL) divergence, to quantify each layer's contribution to the model's output.<n>We also introduce cross-layer synergy, a novel metric based on causal mutual information, to capture dependencies between adjacent layers.
arXiv Detail & Related papers (2025-05-08T00:08:31Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
LRP-QViT: Mixed-Precision Vision Transformer Quantization via Layer-wise Relevance Propagation [0.0]
We introduce LRP-QViT, an explainability-based method for assigning mixed-precision bit allocations to different layers based on their importance during classification. Our experimental findings demonstrate that both our fixed-bit and mixed-bit post-training quantization methods surpass existing models in the context of 4-bit and 6-bit quantization.
arXiv Detail & Related papers (2024-01-20T14:53:19Z)
Mixed-Precision Quantization with Cross-Layer Dependencies [6.338965603383983]
Mixed-precision quantization (MPQ) assigns varied bit-widths to layers to optimize the accuracy-efficiency trade-off. Existing methods simplify the MPQ problem by assuming that quantization errors at different layers act independently. We show that this assumption does not reflect the true behavior of quantized deep neural networks.
arXiv Detail & Related papers (2023-07-11T15:56:00Z)
CSQ: Growing Mixed-Precision Quantization Scheme with Bi-level Continuous Sparsification [51.81850995661478]
Mixed-precision quantization has been widely applied on deep neural networks (DNNs) Previous attempts on bit-level regularization and pruning-based dynamic precision adjustment during training suffer from noisy gradients and unstable convergence. We propose Continuous Sparsification Quantization (CSQ), a bit-level training method to search for mixed-precision quantization schemes with improved stability.
arXiv Detail & Related papers (2022-12-06T05:44:21Z)
On Robust Learning from Noisy Labels: A Permutation Layer Approach [53.798757734297986]
This paper introduces a permutation layer learning approach termed PermLL to dynamically calibrate the training process of a deep neural network (DNN) We provide two variants of PermLL in this paper: one applies the permutation layer to the model's prediction, while the other applies it directly to the given noisy label. We validate PermLL experimentally and show that it achieves state-of-the-art performance on both real and synthetic datasets.
arXiv Detail & Related papers (2022-11-29T03:01:48Z)
Neural Networks with Quantization Constraints [111.42313650830248]
We present a constrained learning approach to quantization training. We show that the resulting problem is strongly dual and does away with gradient estimations. We demonstrate that the proposed approach exhibits competitive performance in image classification tasks.
arXiv Detail & Related papers (2022-10-27T17:12:48Z)
BiTAT: Neural Network Binarization with Task-dependent Aggregated Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation. Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration. This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z)
Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications. Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors. Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z)
Mixed Precision of Quantization of Transformer Language Models for Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications. Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors. The optimal local precision settings are automatically learned using two techniques. Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z)
PTQ-SL: Exploring the Sub-layerwise Post-training Quantization [6.0070278366995105]
Network quantization is a powerful technique to compress convolutional neural networks. The quantization granularity determines how to share the scaling factors in weights. We propose an efficient post-training quantization method in sub-layerwise granularity (PTQ-SL)
arXiv Detail & Related papers (2021-10-15T02:18:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.