LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning
- URL: http://arxiv.org/abs/2505.18724v1
- Date: Sat, 24 May 2025 14:47:28 GMT
- Title: LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning
- Authors: Junyu Chen, Junzhuo Li, Zhen Peng, Wenjie Wang, Yuxiang Ren, Long Shi, Xuming Hu,
- Abstract summary: Quantization and fine-tuning are crucial for deploying large language models (LLMs) on resource-constrained edge devices.<n> fine-tuning quantized models presents significant challenges, primarily stemming from the mismatch in data types between the low-precision quantized weights and the high-precision adaptation weights.<n>LoTA-QAF is a novel fine-tuning method specifically designed for quantized LLMs.
- Score: 27.07694377337617
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Quantization and fine-tuning are crucial for deploying large language models (LLMs) on resource-constrained edge devices. However, fine-tuning quantized models presents significant challenges, primarily stemming from: First, the mismatch in data types between the low-precision quantized weights (e.g., 4-bit) and the high-precision adaptation weights (e.g., 16-bit). This mismatch limits the computational efficiency advantage offered by quantized weights during inference. Second, potential accuracy degradation when merging these high-precision adaptation weights into the low-precision quantized weights, as the adaptation weights often necessitate approximation or truncation. Third, as far as we know, no existing methods support the lossless merging of adaptation while adjusting all quantized weights. To address these challenges, we introduce lossless ternary adaptation for quantization-aware fine-tuning (LoTA-QAF). This is a novel fine-tuning method specifically designed for quantized LLMs, enabling the lossless merging of ternary adaptation weights into quantized weights and the adjustment of all quantized weights. LoTA-QAF operates through a combination of: i) A custom-designed ternary adaptation (TA) that aligns ternary weights with the quantization grid and uses these ternary weights to adjust quantized weights. ii) A TA-based mechanism that enables the lossless merging of adaptation weights. iii) Ternary signed gradient descent (t-SignSGD) for updating the TA weights. We apply LoTA-QAF to Llama-3.1/3.3 and Qwen-2.5 model families and validate its effectiveness on several downstream tasks. On the MMLU benchmark, our method effectively recovers performance for quantized models, surpassing 16-bit LoRA by up to 5.14\%. For task-specific fine-tuning, 16-bit LoRA achieves superior results, but LoTA-QAF still outperforms other methods.
Related papers
- Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression [55.323397702682506]
Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining.<n>We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery.
arXiv Detail & Related papers (2025-04-10T02:19:03Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - Taming Sensitive Weights : Noise Perturbation Fine-tuning for Robust LLM Quantization [5.718172547021947]
We propose Noise Perturbation Fine-tuning (NPFT) to tame the sensitive weights' impact on the quantization error.<n>NPFT identifies outlier weights and add random weight perturbations on the outliers as the model going through a PEFT optimization.<n>When applied to OPT and LLaMA models, our NPFT method achieves stable performance improvements for both uniform and non-uniform quantizers.
arXiv Detail & Related papers (2024-12-08T21:46:22Z) - IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models [68.55148272295916]
IntLoRA adapts quantized diffusion models with integer-type low-rank parameters to include inference efficiency during tuning.<n>During inference, IntLoRA weights can be seamlessly merged into pre-trained weights to directly obtain quantized downstream weights without PTQ.
arXiv Detail & Related papers (2024-10-29T05:50:17Z) - FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach that enhances the flatness of weights and activations.<n>Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective.<n>It achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%.
arXiv Detail & Related papers (2024-10-12T08:10:28Z) - Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [51.32182730502002]
We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T09:45:16Z) - Towards Accurate Post-Training Quantization of Vision Transformers via Error Reduction [48.740630807085566]
Post-training quantization (PTQ) for vision transformers (ViTs) has received increasing attention from both academic and industrial communities.<n>Current methods fail to account for the complex interactions between quantized weights and activations, resulting in significant quantization errors and suboptimal performance.<n>This paper presents ERQ, an innovative two-step PTQ method specifically crafted to reduce quantization errors arising from activation and weight quantization sequentially.
arXiv Detail & Related papers (2024-07-09T12:06:03Z) - OAC: Output-adaptive Calibration for Accurate Post-training Quantization [28.67781845829386]
Post-training Quantization (PTQ) techniques have been developed to compress Large Language Models (LLMs)<n>Most PTQ approaches formulate the quantization error based on a layer-wise Euclidean loss, ignoring the model output.<n>We propose Output-adaptive Quantization (OAC) to incorporate the model output in the calibration process.
arXiv Detail & Related papers (2024-05-23T20:01:17Z) - Scheduling Weight Transitions for Quantization-Aware Training [26.792400685888175]
Quantization-aware training (QAT) simulates a quantization process during training to lower bit-precision of weights/activations.<n>We introduce a transition rate (TR) scheduling technique that controls the number of transitions of quantized weights explicitly.
arXiv Detail & Related papers (2024-04-30T04:12:36Z) - L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models [5.304907804008533]
We propose L4Q, a method that integrates Quantization-Aware Training (QAT) with Low-Rank Adaptation (LoRA)<n>By employing a memory-optimized layer design, L4Q significantly reduces QAT's memory overhead, making its training cost comparable to LoRA.<n>Our experiments demonstrate that this combined approach to quantization and fine-tuning achieves superior accuracy.
arXiv Detail & Related papers (2024-02-07T14:35:05Z) - Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision.
Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations.
Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.