RPTQ: Reorder-based Post-training Quantization for Large Language Models
- URL: http://arxiv.org/abs/2304.01089v4
- Date: Wed, 17 May 2023 10:07:33 GMT
- Title: RPTQ: Reorder-based Post-training Quantization for Large Language Models
- Authors: Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang
Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, Bingzhe Wu
- Abstract summary: Large-scale language models (LLMs) have demonstrated impressive performance, but their deployment presents challenges due to their significant memory usage.
We introduce a quantization method called RPTQ, which utilizes a reorder-based approach.
In our experiments, RPTQ achieved a significant breakthrough by utilizing 3-bit activation in LLMs for the first time, resulting in a substantial reduction in memory usage.
- Score: 46.03754730678076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale language models (LLMs) have demonstrated impressive performance,
but their deployment presents challenges due to their significant memory usage.
This issue can be alleviated through quantization. In this paper, we identify
that the challenge in quantizing activations in LLMs arises from varying ranges
across channels, rather than solely the presence of outliers. To address this
challenge, we introduce a quantization method called RPTQ, which utilizes a
reorder-based approach. By rearranging the channels and quantizing them in
clusters, RPTQ effectively mitigates the impact of range differences between
channels. To minimize the overhead of the reorder operation, we fuse it into
the layer norm operation and weights in linear layers. In our experiments, RPTQ
achieved a significant breakthrough by utilizing 3-bit activation in LLMs for
the first time, resulting in a substantial reduction in memory usage. For
instance, quantizing OPT-175b can lead to a memory consumption reduction of up
to 80%.
Related papers
- Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization.
This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - ApiQ: Finetuning of 2-Bit Quantized Large Language Model [12.328293460903911]
ApiQ is designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs.
It consistently achieves superior finetuning results across various bit-widths.
arXiv Detail & Related papers (2024-02-07T09:36:54Z) - QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models [44.515165695546614]
Quantization-Aware Training (QAT) offers a solution, but its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for Large Language Models (LLMs)
We propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs.
arXiv Detail & Related papers (2023-10-12T05:25:49Z) - Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models [7.485068491216164]
Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks.
Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers.
We propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel.
arXiv Detail & Related papers (2023-09-27T09:48:31Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Solving Oscillation Problem in Post-Training Quantization Through a
Theoretical Perspective [74.48124653728422]
Post-training quantization (PTQ) is widely regarded as one of the most efficient compression methods practically.
We argue that an overlooked problem of oscillation is in the PTQ methods.
arXiv Detail & Related papers (2023-03-21T14:52:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.