Related papers: Efficient Fine-Tuning of Quantized Models via Adaptive Rank and Bitwidth

Efficient Fine-Tuning of Quantized Models via Adaptive Rank and Bitwidth

URL: http://arxiv.org/abs/2505.03802v3
Date: Tue, 10 Jun 2025 09:41:35 GMT
Title: Efficient Fine-Tuning of Quantized Models via Adaptive Rank and Bitwidth
Authors: Changhai Zhou, Shijie Han, Shiyang Zhang, Yuhua Zhou, Weizhong Zhang, Cheng Jin,
Abstract summary: QLoRA effectively combines low-bit quantization and LoRA to achieve memory-friendly fine-tuning for large language models (LLM)<n>We propose textbfQR-Adaptor, a unified, gradient-free strategy that uses partial calibration data to jointly search the quantization components and the rank of low-rank spaces for each layer.<n>Our approach achieves a 4.89% accuracy improvement on GSM8K, and in some cases even outperforms the 16-bit fine-tuned model while maintaining the memory footprint of the 4-bit setting.
Score: 10.872650037112255
License: http://creativecommons.org/licenses/by/4.0/
Abstract: QLoRA effectively combines low-bit quantization and LoRA to achieve memory-friendly fine-tuning for large language models (LLM). Recently, methods based on SVD for continuous update iterations to initialize LoRA matrices to accommodate quantization errors have generally failed to consistently improve performance. Dynamic mixed precision is a natural idea for continuously improving the fine-tuning performance of quantized models, but previous methods often optimize low-rank subspaces or quantization components separately, without considering their synergy. To address this, we propose \textbf{QR-Adaptor}, a unified, gradient-free strategy that uses partial calibration data to jointly search the quantization components and the rank of low-rank spaces for each layer, thereby continuously improving model performance. QR-Adaptor does not minimize quantization error but treats precision and rank allocation as a discrete optimization problem guided by actual downstream performance and memory usage. Compared to state-of-the-art (SOTA) quantized LoRA fine-tuning methods, our approach achieves a 4.89\% accuracy improvement on GSM8K, and in some cases even outperforms the 16-bit fine-tuned model while maintaining the memory footprint of the 4-bit setting.

Related papers

Low-rank Momentum Factorization for Memory Efficient Training [13.464518325870444]
Momentum Factorized (MoFaSGD) maintains a dynamically updated low-rank SVD representation of the first-order momentum.<n>We demonstrate MoFaSGD's effectiveness on large language model benchmarks, achieving a competitive trade-off between memory reduction (e.g. LoRA) and performance.
arXiv Detail & Related papers (2025-07-10T18:04:52Z)
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
Fine-tuning Quantized Neural Networks with Zeroth-order Optimization [18.645267970472936]
Quantized Zeroth-order Optimization (QZO) is a novel approach that perturbs the continuous quantization scale for estimation and uses a directional derivative clipping method to stabilize training.<n>QZO can reduce the total memory cost by more than 18$times$ for 4-bit LLMs, and enables fine-tuning Llama-2-13B and Stable Diffusion 3.5 Large within a single 24GB GPU.
arXiv Detail & Related papers (2025-05-19T17:55:15Z)
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [65.37942405146232]
We present a novel type of overload that carries with extremely lightweight state elements, achieved through ultra-low-precision quantization.<n>The proposed SOLO achieves substantial memory savings (approximately 45 GB when training a 7B model) with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
FineQ: Software-Hardware Co-Design for Low-Bit Fine-Grained Mixed-Precision Quantization of LLMs [13.951330786310262]
FineQ is a software- hardware co-design for low-bit fine-grained mixed-precision quantization of large language models.<n>It partitions the weights into finer-grained clusters and considers the distribution of outliers within these clusters.<n>It achieves higher model accuracy compared to the SOTA mixed-precision quantization algorithm at a close average bit-width.
arXiv Detail & Related papers (2025-04-28T12:47:23Z)
GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models [2.1388885579612804]
GANQ is a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM.<n>Extensive experiments demonstrate GANQ's ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization.
arXiv Detail & Related papers (2025-01-22T15:29:09Z)
QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models [3.093903491123962]
Large language models (LLMs) have significantly advanced various natural language processing (NLP) tasks.<n> structured pruning is an effective approach to reducing model size, but it often results in significant accuracy degradation.<n>We introduce quantization into the structured pruning framework to reduce memory consumption during both fine-tuning and inference.<n>We propose QPruner, a novel framework that employs structured pruning to reduce model size, followed by a layer-wise mixed-precision quantization scheme.
arXiv Detail & Related papers (2024-12-16T10:14:01Z)
GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.<n>Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.<n>Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z)
SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression [7.6131620435684875]
SLIM is a new one-shot compression framework that holistically integrates hardware-friendly quantization, sparsity, and low-rank approximation.<n>SLIM improves model accuracy by up to 5.66% (LLaMA-2-7B) for 2:4 sparsity with 4-bit weight quantization, outperforming prior methods.<n>We also propose an optional PEFT recipe that further improves accuracy by up to 1.66% (LLaMA-2-13B) compared to SLIM without fine-tuning.
arXiv Detail & Related papers (2024-10-12T18:36:07Z)
QuAILoRA: Quantization-Aware Initialization for LoRA [46.00375834217641]
QLoRA reduces the memory-cost of fine-tuning a large language model (LLM) with LoRA by quantizing the base LLM. QLoRA introduces quantization errors that negatively impact model performance after fine-tuning.
arXiv Detail & Related papers (2024-10-09T19:06:37Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [66.85589263870702]
Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. Experiments on finetuning RoBERTa and LLaMA-2 demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines.
arXiv Detail & Related papers (2023-11-20T18:57:41Z)
Fully Quantized Image Super-Resolution Networks [81.75002888152159]
We propose a Fully Quantized image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy. We apply our quantization scheme on multiple mainstream super-resolution architectures, including SRResNet, SRGAN and EDSR. Our FQSR using low bits quantization can achieve on par performance compared with the full-precision counterparts on five benchmark datasets.
arXiv Detail & Related papers (2020-11-29T03:53:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.