Achieving binary weight and activation for LLMs using Post-Training Quantization
- URL: http://arxiv.org/abs/2504.05352v1
- Date: Mon, 07 Apr 2025 04:50:04 GMT
- Title: Achieving binary weight and activation for LLMs using Post-Training Quantization
- Authors: Siqing Song, Chuang Wang, Ruiqi Wang, Yi Yang, Xuyao Zhang,
- Abstract summary: Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs.<n>Existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4)<n>We propose a post-training quantization framework with W (1+1)A (1*4) configuration, where weights are quantized to 1 bit with an additional 1 bit for fine-grain grouping and activations are quantized to 1 bit with a 4-fold increase in the number of channels.
- Score: 32.247768389896144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4). In this paper, we propose a post-training quantization framework with W(1+1)A(1*4) configuration, where weights are quantized to 1 bit with an additional 1 bit for fine-grain grouping and activations are quantized to 1 bit with a 4-fold increase in the number of channels. For weight quantization, we propose utilizing Hessian-aware fine-grained grouping along with an EM-based quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 * INT1 format equivalently and simultaneously smooth the scaling factors based on quantization errors, which further reduces the quantization errors in activations. Our method surpasses state-of-the-art (SOTA) LLM quantization baselines on W2A4 across multiple tasks, pushing the boundaries of existing LLM quantization methods toward fully binarized models.
Related papers
- MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration [23.752021919501207]
We propose MergeQuant, an accurate and efficient per-channel static quantization framework.<n> MergeQuant integrates the per-channel quantization steps with the corresponding scalings and linear mappings through a Quantization Step Migration (QSM) method.<n>On Llama-2-7B model, MergeQuant achieves up to 1.77x speedup in decoding, and up to 2.06x speedup in end-to-end compared to FP16 baseline.
arXiv Detail & Related papers (2025-03-07T04:52:28Z) - PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.<n>We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.<n>Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z) - CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression [6.859010157930106]
Post-Training Quantization (PTQ) is an effective technique for compressing Large Language Models (LLMs)
"quantization kernel" refers to the set of elements in activations that are quantized to zero.
We propose CrossQuant: a simple yet effective method for quantizing activations.
arXiv Detail & Related papers (2024-10-10T00:44:24Z) - MobileQuant: Mobile-friendly Quantization for On-device Language Models [31.75012542498791]
Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications.
deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute costs.
We introduce a simple post-training quantization method, named MobileQuant, that extends previous weight equivalent transformation works.
arXiv Detail & Related papers (2024-08-25T20:41:22Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z) - OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.
Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z) - CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.<n>We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.<n> CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.