OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and
Inference of Large Language Models
- URL: http://arxiv.org/abs/2306.02272v4
- Date: Wed, 24 Jan 2024 02:53:27 GMT
- Title: OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and
Inference of Large Language Models
- Authors: Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, Eunhyeok Park
- Abstract summary: outlier-aware weight quantization (OWQ) method minimizes large language models' footprint through low-precision representation.
OWQ prioritizes a small subset of structured weights sensitive to quantization, storing them in high-precision, while applying highly tuned quantization to the remaining dense weights.
Experiments demonstrate that 3.1-bit models using OWQ perform comparably to 4-bit models optimized by OPTQ.
- Score: 15.461748851931588
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) with hundreds of billions of parameters require
powerful server-grade GPUs for inference, limiting their practical deployment.
To address this challenge, we introduce the outlier-aware weight quantization
(OWQ) method, which aims to minimize LLM's footprint through low-precision
representation. OWQ prioritizes a small subset of structured weights sensitive
to quantization, storing them in high-precision, while applying highly tuned
quantization to the remaining dense weights. This sensitivity-aware
mixed-precision scheme reduces the quantization error notably, and extensive
experiments demonstrate that 3.1-bit models using OWQ perform comparably to
4-bit models optimized by OPTQ. Furthermore, OWQ incorporates a
parameter-efficient fine-tuning for task-specific adaptation, called weak
column tuning (WCT), enabling accurate task-specific LLM adaptation with
minimal memory overhead in the optimized format. OWQ represents a notable
advancement in the flexibility, efficiency, and practicality of LLM
optimization literature. The source code is available at
https://github.com/xvyaward/owq
Related papers
- GWQ: Gradient-Aware Weight Quantization for Large Language Models [61.17678373122165]
gradient-aware weight quantization (GWQ) is the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers.
GWQ retains the corresponding to the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit format.
In the zero-shot task, GWQ quantized models have higher accuracy compared to other quantization methods.
arXiv Detail & Related papers (2024-10-30T11:16:04Z) - SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models [2.867517731896504]
SQFT is an end-to-end solution for low-precision sparse parameter-efficient fine-tuning of large pre-trained models.
SQFT allows for effective model manipulation in resource-constrained environments.
SQFT also addresses the challenge of having quantized weights and adapters with different numerical precisions.
arXiv Detail & Related papers (2024-10-01T19:49:35Z) - Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance [20.659750151408186]
Large Language Models (LLMs) have demonstrated impressive performance across various domains.
Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA)
We propose Quantized LLMs with Balanced-rank Adaptation (Q-BaRA) and Quantization-Aware Fine-tuning with Higher Rank Adaptation (QA-HiRA)
arXiv Detail & Related papers (2024-07-24T06:16:37Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision.
Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations.
Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.