AFPQ: Asymmetric Floating Point Quantization for LLMs
- URL: http://arxiv.org/abs/2311.01792v1
- Date: Fri, 3 Nov 2023 09:07:09 GMT
- Title: AFPQ: Asymmetric Floating Point Quantization for LLMs
- Authors: Yijia Zhang, Sicheng Zhang, Shijie Cao, Dayou Du, Jianyu Wei, Ting
Cao, Ningyi Xu
- Abstract summary: We propose asymmetric FP quantization (AFPQ), which sets separate scales for positive and negative values.
Our method leads to large accuracy improvements and can be easily plugged into other quantization methods, including GPTQ and AWQ.
No additional storage is needed compared with asymmetric integer (INT) quantization.
- Score: 6.176074875528637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) show great performance in various tasks, but
face deployment challenges from limited memory capacity and bandwidth. Low-bit
weight quantization can save memory and accelerate inference. Although
floating-point (FP) formats show good performance in LLM quantization, they
tend to perform poorly with small group sizes or sub-4 bits. We find the reason
is that the absence of asymmetry in previous FP quantization makes it
unsuitable for handling asymmetric value distribution of LLM weight tensors. In
this work, we propose asymmetric FP quantization (AFPQ), which sets separate
scales for positive and negative values. Our method leads to large accuracy
improvements and can be easily plugged into other quantization methods,
including GPTQ and AWQ, for better performance. Besides, no additional storage
is needed compared with asymmetric integer (INT) quantization. The code is
available at https://github.com/zhangsichengsjtu/AFPQ.
Related papers
- IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models [68.55148272295916]
We propose IntLoRA, to push the efficiency limits by using integer type (INT) low-rank parameters to adapt the quantized diffusion models.
IntLoRA offers three key advantages: (i) for fine-tuning, the pre-trained weights are quantized, reducing memory usage; (ii) for storage, both pre-trained and low-rank weights are in INT which consumes less disk space; (iii) for inference, IntLoRA weights can be naturally merged into quantized pre-trained weights through efficient integer multiplication or bit-shifting.
arXiv Detail & Related papers (2024-10-29T05:50:17Z) - Pyramid Vector Quantization for LLMs [8.779688608449902]
Pyramid Vector Quantization (PVQ) for large language models.
PVQ uses a fixed integer lattice on the sphere by projecting points onto the 1-sphere, which allows for efficient encoding and decoding without requiring an explicit codebook in memory.
We achieve state-of-the-art quantization performance with pareto-optimal trade-off between performance and bits per weight and bits per activation, compared to compared methods.
arXiv Detail & Related papers (2024-10-22T11:57:32Z) - Low-Rank Quantization-Aware Training for LLMs [8.535254310145005]
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands.
We propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs.
Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage.
arXiv Detail & Related papers (2024-06-10T15:44:22Z) - LLM-FP4: 4-Bit Floating-Point Quantized Transformers [38.23587031169402]
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values.
Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions.
Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1.
arXiv Detail & Related papers (2023-10-25T17:59:32Z) - Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models [7.485068491216164]
Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks.
Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers.
We propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel.
arXiv Detail & Related papers (2023-09-27T09:48:31Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z) - AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [54.692405042065815]
We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization.
AWQ protects only 1% salient weights and achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs.
We also implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs.
arXiv Detail & Related papers (2023-06-01T17:59:10Z) - Integer or Floating Point? New Outlooks for Low-Bit Quantization on
Large Language Models [17.055400141733124]
Low-bit integer formats (e.g., INT8/INT4) have been the conventional choice for large language models (LLMs)
Low-bit floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU.
We propose the Mixture of Formats Quantization (MoFQ), which selects the optimal format on a layer-wise basis.
arXiv Detail & Related papers (2023-05-21T05:28:37Z) - Towards Efficient Post-training Quantization of Pre-trained Language
Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues.
Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.