INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error
Correction through Low-Rank Adaptation
- URL: http://arxiv.org/abs/2306.08162v1
- Date: Tue, 13 Jun 2023 22:25:35 GMT
- Title: INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error
Correction through Low-Rank Adaptation
- Authors: Yuji Chai, John Gkountouras, Glenn G. Ko, David Brooks, Gu-Yeon Wei
- Abstract summary: We introduce a method that dramatically reduces fine-tuning VRAM requirements and rectifies quantization errors in quantized Large Language Models.
Our method reduces the memory requirements by up to 5.6 times, which enables fine-tuning a 7 billion parameter Large Language Model (LLM) on consumer laptops.
- Score: 5.837035655563323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a method that dramatically reduces fine-tuning VRAM requirements
and rectifies quantization errors in quantized Large Language Models. First, we
develop an extremely memory-efficient fine-tuning (EMEF) method for quantized
models using Low-Rank Adaptation (LoRA), and drawing upon it, we construct an
error-correcting algorithm designed to minimize errors induced by the
quantization process. Our method reduces the memory requirements by up to 5.6
times, which enables fine-tuning a 7 billion parameter Large Language Model
(LLM) on consumer laptops. At the same time, we propose a Low-Rank Error
Correction (LREC) method that exploits the added LoRA layers to ameliorate the
gap between the quantized model and its float point counterpart. Our error
correction framework leads to a fully functional INT2 quantized LLM with the
capacity to generate coherent English text. To the best of our knowledge, this
is the first INT2 Large Language Model that has been able to reach such a
performance. The overhead of our method is merely a 1.05 times increase in
model size, which translates to an effective precision of INT2.1. Also, our
method readily generalizes to other quantization standards, such as INT3, INT4,
and INT8, restoring their lost performance, which marks a significant milestone
in the field of model quantization. The strategies delineated in this paper
hold promising implications for the future development and optimization of
quantized models, marking a pivotal shift in the landscape of low-resource
machine learning computations.
Related papers
- Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization.
This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z) - Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs [2.7624021966289605]
Large Language Models (LLMs) have revolutionized natural language understanding and generation tasks.
LLMs suffer from high memory consumption and slow inference times due to their large parameter sizes.
This paper introduces SLiM, a novel approach for compressing LLMs using a one-shot Quantized Sparse Plus Low-rank Approximation.
arXiv Detail & Related papers (2024-10-12T18:36:07Z) - Q-VLM: Post-training Quantization for Large Vision-Language Models [73.19871905102545]
We propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference.
We mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy.
Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation.
arXiv Detail & Related papers (2024-10-10T17:02:48Z) - TernaryLLM: Ternarized Large Language Model [29.29122031050894]
Large language models (LLMs) have achieved remarkable performance on Natural Language Processing (NLP) tasks.
We introduce Dual Learnable Ternarization (DLT), which enables both scales and shifts to be learnable.
We also propose Outlier-Friendly Feature Knowledge Distillation (OFF) to recover the information lost in extremely low-bit quantization.
arXiv Detail & Related papers (2024-06-11T11:40:12Z) - Characterizing the Accuracy -- Efficiency Trade-off of Low-rank Decomposition in Language Models [1.401463252785724]
Low-rank decomposition can be a promising direction for LLM-based applications that require real-time service at scale.
We formalize the low-rank decomposition design space and show that the decomposition design space is enormous.
Our results show that we can achieve a 9% model size reduction with minimal accuracy drops.
arXiv Detail & Related papers (2024-05-10T17:40:02Z) - Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models [88.80146574509195]
Quantization is a promising approach for reducing memory overhead and accelerating inference.
We propose a novel-aware quantization (ZSAQ) framework for the zero-shot quantization of various PLMs.
arXiv Detail & Related papers (2023-10-20T07:09:56Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.