Why Do Some Inputs Break Low-Bit LLM Quantization?
- URL: http://arxiv.org/abs/2506.12044v1
- Date: Sat, 24 May 2025 16:17:50 GMT
- Title: Why Do Some Inputs Break Low-Bit LLM Quantization?
- Authors: Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia,
- Abstract summary: Low-bit weight-only quantization significantly reduces the memory footprint of large language models (LLMs)<n>We analyze diverse 3-4 bit methods on LLMs ranging from 7B-70B in size and find that the quantization errors of 50 pairs of methods are strongly correlated (avg. 0.82) on FineWeb examples.
- Score: 27.428207255250676
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Low-bit weight-only quantization significantly reduces the memory footprint of large language models (LLMs), but disproportionately affects certain examples. We analyze diverse 3-4 bit methods on LLMs ranging from 7B-70B in size and find that the quantization errors of 50 pairs of methods are strongly correlated (avg. 0.82) on FineWeb examples. Moreover, the residual stream magnitudes of full-precision models are indicative of future quantization errors. We further establish a hypothesis that relates the residual stream magnitudes to error amplification and accumulation over layers. Using LLM localization techniques, early exiting, and activation patching, we show that examples with large errors rely on precise residual activations in the late layers, and that the outputs of MLP gates play a crucial role in maintaining the perplexity. Our work reveals why certain examples result in large quantization errors and which model components are most critical for performance preservation.
Related papers
- Quantizing Large Language Models for Code Generation: A Differentiated Replication [51.85505914274633]
Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language.<n>LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint.<n>New frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70%.
arXiv Detail & Related papers (2025-03-10T09:26:08Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - ASER: Activation Smoothing and Error Reconstruction for Large Language Model Quantization [18.017182472532415]
ASER is an algorithm consisting of low-rank compensation for quantization error with LoRA-style matrices constructed by whitening SVD.<n>ASER is capable of quantizing typical outliers to low-bit ones, particularly preserving accuracy even in W4A8 per-channel setup.
arXiv Detail & Related papers (2024-11-12T12:52:04Z) - QuAILoRA: Quantization-Aware Initialization for LoRA [46.00375834217641]
QLoRA reduces the memory-cost of fine-tuning a large language model (LLM) with LoRA by quantizing the base LLM.
QLoRA introduces quantization errors that negatively impact model performance after fine-tuning.
arXiv Detail & Related papers (2024-10-09T19:06:37Z) - Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [51.32182730502002]
We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T09:45:16Z) - Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs [27.38239289662178]
Post-Training Quantization (PTQ) enhances the efficiency of Large Language Models (LLMs)
We explore the role of calibration sets in PTQ, specifically their effect on hidden activations.
Our analysis reveals a marked contrast in quantization effectiveness across accessible models.
arXiv Detail & Related papers (2024-05-31T14:24:33Z) - Quantifying the Capabilities of LLMs across Scale and Precision [12.879551933541345]
This study investigates the effect of model scale and quantization on the performance of instruct models.
We show that larger models show exceptional resilience to precision reduction and can maintain high accuracy even at 4-bit quantization.
arXiv Detail & Related papers (2024-05-06T03:42:34Z) - Temporal Scaling Law for Large Language Models [57.83580734589091]
We propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up.<n>In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position.<n>We derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law.
arXiv Detail & Related papers (2024-04-27T05:49:11Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z) - ApiQ: Finetuning of 2-Bit Quantized Large Language Model [12.328293460903911]
ApiQ is designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs.
It consistently achieves superior finetuning results across various bit-widths.
arXiv Detail & Related papers (2024-02-07T09:36:54Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.