Related papers: Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners

Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners

URL: http://arxiv.org/abs/2407.15508v3
Date: Thu, 15 May 2025 05:34:45 GMT
Title: Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners
Authors: Yifei Gao, Jie Ou, Lei Wang, Jun Cheng, Mengchu Zhou,
Abstract summary: We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
Score: 51.32182730502002
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The quantization of large language models (LLMs) has been a prominent research area aimed at enabling their lightweight deployment in practice. Existing research about LLM's quantization has mainly explored the interplay between weights and activations, or employing auxiliary components while neglecting the necessity of adjusting weights during quantization. Consequently, original weight distributions frequently fail to yield desired results after round-to-nearest (RTN) quantization. Even though incorporating techniques such as mixed precision and low-rank error approximation in LLM's quantization can yield improved results, they inevitably introduce additional computational overhead. On the other hand, traditional techniques for weight quantization, such as Generative Post-Training Quantization, rely on manually tweaking weight distributions to minimize local errors, but they fall short of achieving globally optimal outcomes. Although the recently proposed Learnable Singular-value Increment improves global weight quantization by modifying weight distributions, it disrupts the original distribution considerably. This introduces pronounced bias toward the training data and can degrade downstream task performance. In this paper, we introduce Singular-value Diagonal Expansion, a more nuanced approach to refining weight distributions to achieve better quantization alignment. Furthermore, we introduce Cross-layer Learning that improves overall quantization outcomes by distributing errors more evenly across layers. Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches, including OmniQuant, DuQuant, and PrefixQuant.

Related papers

Precision Where It Matters: A Novel Spike Aware Mixed-Precision Quantization Strategy for LLaMA-based Language Models [1.4999444543328293]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks. This paper investigates the quantization of LLMs, focusing on the LLaMA architecture and its derivatives. We propose a novel mixed-precision quantization approach tailored for LLaMA-like models.
arXiv Detail & Related papers (2025-04-30T11:52:18Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Leveraging Pre-Trained Neural Networks to Enhance Machine Learning with Variational Quantum Circuits [48.33631905972908]
We introduce an innovative approach that utilizes pre-trained neural networks to enhance Variational Quantum Circuits (VQC) This technique effectively separates approximation error from qubit count and removes the need for restrictive conditions. Our results extend to applications such as human genome analysis, demonstrating the broad applicability of our approach.
arXiv Detail & Related papers (2024-11-13T12:03:39Z)
A Comprehensive Study on Quantization Techniques for Large Language Models [0.0]
Large Language Models (LLMs) have been extensively researched and used in both academia and industry. LLMs present significant challenges for deployment on resource-constrained IoT devices and embedded systems. Quantization, a technique that reduces the precision of model values to a smaller set of discrete values, offers a promising solution.
arXiv Detail & Related papers (2024-10-30T04:55:26Z)
IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models [68.55148272295916]
IntLoRA adapts quantized diffusion models with integer-type low-rank parameters to include inference efficiency during tuning.<n>During inference, IntLoRA weights can be seamlessly merged into pre-trained weights to directly obtain quantized downstream weights without PTQ.
arXiv Detail & Related papers (2024-10-29T05:50:17Z)
Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview [4.166341398835636]
We discuss the necessity and impact of model size growth, highlighting the performance benefits as well as the computational challenges and environmental considerations. We delve into various quantization techniques, including both post-training quantization (PTQ) and quantization-aware training (QAT) We examine how these methods address issues like outliers, importance weighting, and activation quantization, ultimately contributing to more sustainable and accessible deployment of large-scale models.
arXiv Detail & Related papers (2024-09-18T02:35:00Z)
Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation [70.22782550540714]
Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
arXiv Detail & Related papers (2024-08-07T12:42:09Z)
LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices [41.17378536966264]
Low-Rank Quantization (LRQ) reconstructs the outputs of an intermediate Transformer block by leveraging low-rank weight-scaling matrices.<n>Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while enabling the individual scaling of weights.<n>We show the superiority of LRQ over prior LLM PTQ works under (i) 8-bit weight and per-tensor activation quantization, (ii) 4-bit weight and 8-bit per-token activation quantization, and (iii) low-bit weight-only quantization schemes.
arXiv Detail & Related papers (2024-07-16T09:32:07Z)
Towards Accurate Post-Training Quantization of Vision Transformers via Error Reduction [48.740630807085566]
Post-training quantization (PTQ) for vision transformers (ViTs) has received increasing attention from both academic and industrial communities.<n>Current methods fail to account for the complex interactions between quantized weights and activations, resulting in significant quantization errors and suboptimal performance.<n>This paper presents ERQ, an innovative two-step PTQ method specifically crafted to reduce quantization errors arising from activation and weight quantization sequentially.
arXiv Detail & Related papers (2024-07-09T12:06:03Z)
Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other [10.292252814921714]
We introduce Learnable Singular value Increment (LSI) as an advanced solution to quantization problems. LSI uses Singular Value Decomposition to extract singular values of the weights and make them learnable to help weights compensate each other conditioned on activation. We achieve state-of-the-art performance in diverse quantization settings, no matter in weight-only, weight-activation or extremely low bit scenarios.
arXiv Detail & Related papers (2024-06-24T03:52:52Z)
Investigating the Impact of Quantization on Adversarial Robustness [22.637585106574722]
Quantization is a technique for reducing the bit-width of deep models to improve their runtime performance and storage efficiency. In real-world scenarios, quantized models are often faced with adversarial attacks which cause the model to make incorrect inferences. We conduct a first-time analysis of the impact of the quantization pipeline components that can incorporate robust optimization.
arXiv Detail & Related papers (2024-04-08T16:20:15Z)
What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation [55.153595212571375]
Quantization is a technique for improving the memory and computational efficiency of large language models (LLMs) We propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs. We conduct experiments with various artificial perturbations to explore their impact on LLM performance.
arXiv Detail & Related papers (2024-03-11T03:42:51Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models [0.18416014644193066]
AWEQ excels in both ultra-low-bit quantization and 8-bit weight and activation (W8A8) quantization. We have further refined the equalization method to mitigate quantization bias error, ensuring the robustness of the model.
arXiv Detail & Related papers (2023-11-02T15:18:22Z)
PB-LLM: Partially Binarized Large Language Models [14.244537605866864]
This paper explores network binarization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression. We propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs.
arXiv Detail & Related papers (2023-09-29T14:35:27Z)
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations. Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z)
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models. Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation. To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z)
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant. PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error. We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z)
Where Should We Begin? A Low-Level Exploration of Weight Initialization Impact on Quantized Behaviour of Deep Neural Networks [93.4221402881609]
We present an in-depth, fine-grained ablation study of the effect of different weights initialization on the final distributions of weights and activations of different CNN architectures. To our best knowledge, we are the first to perform such a low-level, in-depth quantitative analysis of weights initialization and its effect on quantized behaviour.
arXiv Detail & Related papers (2020-11-30T06:54:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.