Related papers: Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models

Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models

URL: http://arxiv.org/abs/2309.15531v2
Date: Sun, 24 Mar 2024 20:18:28 GMT
Title: Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models
Authors: Jung Hwan Heo, Jeonghoon Kim, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee,
Abstract summary: Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks. Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers. We propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel.
Score: 7.485068491216164
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks. However, efficiently serving LLMs has been a challenge due to the large memory bottleneck, specifically in small batch inference settings (e.g. mobile devices). Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers. To mitigate the undesirable outlier effect, we first propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel (IC) rather than the conventional per-output-channel (per-OC). Our method is motivated by the observation that activation outliers affect the input dimension of the weight matrix, so similarly grouping the weights in the IC direction can isolate outliers within a group. We also find that activation outliers do not dictate quantization difficulty, and inherent weight sensitivities also exist. With per-IC quantization as a new outlier-friendly scheme, we propose Adaptive Dimensions (AdaDim), a versatile quantization framework that can adapt to various weight sensitivity patterns. We demonstrate the effectiveness of AdaDim by augmenting prior methods such as Round-To-Nearest and GPTQ, showing significant improvements across various language modeling benchmarks for both base (up to +4.7% on MMLU) and instruction-tuned (up to +10% on HumanEval) LLMs. Code is available at https://github.com/johnheo/adadim-llm

Related papers

DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models [11.216745641229917]
Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels. Quantization-aware Training (QAT) can alleviate this problem, but it requires significantly more computational resources. We introduce Weight-Decomposed Low-Rank Quantization-Aware Training (DL-QAT), which merges the advantages of QAT while training only less than 1% of the total parameters.
arXiv Detail & Related papers (2025-04-12T13:57:02Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE) RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization [15.01214559812713]
MQuant is a post-training quantization framework designed to tackle the challenges of multimodal large language models (MLLMs) On five mainstream MLLMs (including Qwen-VL, Mini-V, CogVLM2), MQuant under W4A8 achieves near-floating-point accuracy (1% degradation) while reducing inference latency by up to 30%.
arXiv Detail & Related papers (2025-02-01T13:08:02Z)
Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z)
Channel-Wise Mixed-Precision Quantization for Large Language Models [47.00361921910259]
Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. We introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method.
arXiv Detail & Related papers (2024-10-16T21:34:41Z)
OutlierTune: Efficient Channel-Wise Quantization for Large Language Models [24.645237670811476]
OutlierTune is an efficient per-channel post-training quantization method for the activations of large language models. The proposed framework is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference.
arXiv Detail & Related papers (2024-06-27T02:02:26Z)
DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs [40.48697728884967]
Quantization of large language models (LLMs) faces significant challenges, particularly due to the presence of outlier activations. Traditional approaches predominantly address Normal Outliers, which are activations across all tokens with relatively large magnitudes. We introduce DuQuant, a novel approach that utilizes rotation and permutation transformations to more effectively mitigate both massive and normal outliers.
arXiv Detail & Related papers (2024-06-03T18:27:44Z)
Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization [62.15918574997175]
It is known that language models contain outlier channels whose values on average are orders of magnitude higher than other channels. We propose a strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization. We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights.
arXiv Detail & Related papers (2024-04-04T17:25:30Z)
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit. We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z)
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations. Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks. Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM. We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z)
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment. We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs. We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z)
Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models [57.933500846742234]
Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. We propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping. This framework effectively suppresses the outliers and can be used in a plug-and-play mode.
arXiv Detail & Related papers (2022-09-27T12:05:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.