ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with
Modular Quantizers
- URL: http://arxiv.org/abs/2309.16119v2
- Date: Sun, 10 Mar 2024 03:24:06 GMT
- Title: ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with
Modular Quantizers
- Authors: Junjie Yin, Jiahao Dong, Yingheng Wang, Christopher De Sa, Volodymyr
Kuleshov
- Abstract summary: We propose a memory-efficient finetuning algorithm for large language models (LLMs)
lploraattains competitive performance on text classification, natural language inference, and instruction following tasks using significantly less memory than existing approaches.
We also surpass the state-of-the-art ROUGE score on a popular summarization task.
- Score: 38.16040503271727
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a memory-efficient finetuning algorithm for large language models
(LLMs) that supports finetuning LLMs with 65B parameters in 2/3/4-bit precision
on as little as one 24GB GPU. Our method, modular low-rank adaptation
(ModuLoRA), integrates any user-specified weight quantizer with finetuning via
low-rank adapters (LoRAs). Our approach relies on a simple
quantization-agnostic backward pass that adaptively materializes low-precision
LLM weights from a custom black-box quantization module. This approach enables
finetuning 2-bit and 3-bit LLMs for the first time -- leveraging
state-of-the-art 2-bit QuIP\# quantization and 3-bit OPTQ quantization --
outperforming finetuning that relies on less sophisticated 4-bit and 8-bit
methods. In our experiments, \lplora~attains competitive performance on text
classification, natural language inference, and instruction following tasks
using significantly less memory than existing approaches, and we also surpass
the state-of-the-art ROUGE score on a popular summarization task. We release
\lplora~together with a series of low-precision models as part of \llmtune, a
user-friendly library for quantizing, running, and finetuning LLMs on consumer
GPUs.
Related papers
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.
At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.
Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [66.85589263870702]
Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component.
Experiments on finetuning RoBERTa and LLaMA-2 demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines.
arXiv Detail & Related papers (2023-11-20T18:57:41Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.