QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
- URL: http://arxiv.org/abs/2509.17428v3
- Date: Fri, 26 Sep 2025 11:23:43 GMT
- Title: QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
- Authors: Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim,
- Abstract summary: We propose a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel.<n>We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost.
- Score: 14.492535012602625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.
Related papers
- Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models [41.677469535447024]
Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices.<n>Post-training quantization (PTQ) is widely adopted for its efficiency, as it requires no retraining and only a small dataset for calibration.<n>Recent advances for post-training quantization have demonstrated that even sub-4-bit methods can maintain most of the original model performance.
arXiv Detail & Related papers (2025-12-25T12:39:36Z) - Efficiently Training A Flat Neural Network Before It has been Quantizated [7.236012064192816]
Post-training quantization (PTQ) for vision transformers (ViTs) has garnered significant attention due to its efficiency in compressing models.<n>It is unclear how to efficiently train a model-agnostic neural network which is tailored for a predefined precision low-bit model.<n>We propose a framework that proactively pre-conditions the model by measuring and disentangling the error sources.
arXiv Detail & Related papers (2025-11-03T11:21:45Z) - Beyond Outliers: A Study of Optimizers Under Quantization [82.75879062804955]
We study impact of choice on model robustness under quantization.<n>We evaluate how model performance degrades when trained with different baselines.<n>We derive scaling laws for quantization-aware training under different parameters.
arXiv Detail & Related papers (2025-09-27T21:15:22Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - How Can Quantum Deep Learning Improve Large Language Models? [24.83900854986565]
Full fine-tuning achieves strong performance but imposes prohibitive computational and memory costs.<n>Low-rank adaptation (LoRA), Prefix tuning, and sparse low-rank adaptation (SoRA) address this issue by reducing trainable parameters while maintaining competitive accuracy.<n>Recent advances in quantum deep learning introduce novel opportunities through quantum-inspired encoding and parameterized quantum circuits.
arXiv Detail & Related papers (2025-09-17T08:18:58Z) - ZeroQAT: Your Quantization-aware Training but Efficient [53.25965863436039]
Quantization is an effective technique to reduce the deployment cost of large language models (LLMs)<n>Existing low-bit PTQ methods suffer from accuracy degradation because their layer-wise optimization introduces cumulative error propagation and misalignment between local reconstruction objectives and downstream performance.<n>We propose ZeroQAT, a zeroth-order optimization-based QAT framework.
arXiv Detail & Related papers (2025-08-21T01:18:27Z) - FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation [55.12070409045766]
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years.<n>Current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization.
arXiv Detail & Related papers (2025-06-13T07:57:38Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models [2.867517731896504]
SQFT is an end-to-end solution for low-precision sparse parameter-efficient fine-tuning of large pre-trained models.
SQFT allows for effective model manipulation in resource-constrained environments.
SQFT also addresses the challenge of having quantized weights and adapters with different numerical precisions.
arXiv Detail & Related papers (2024-10-01T19:49:35Z) - Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance [20.659750151408186]
Large Language Models (LLMs) have demonstrated impressive performance across various domains.<n>Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA)<n>We propose Quantization-Aware fine-tuning with Balanced Low-Rank Adaptation (QA-BLoRA)
arXiv Detail & Related papers (2024-07-24T06:16:37Z) - Test-Time Model Adaptation with Only Forward Passes [68.11784295706995]
Test-time adaptation has proven effective in adapting a given trained model to unseen test samples with potential distribution shifts.
We propose a test-time Forward-Optimization Adaptation (FOA) method.
FOA runs on quantized 8-bit ViT, outperforms gradient-based TENT on full-precision 32-bit ViT, and achieves an up to 24-fold memory reduction on ImageNet-C.
arXiv Detail & Related papers (2024-04-02T05:34:33Z) - L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models [5.304907804008533]
We propose L4Q, a method that integrates Quantization-Aware Training (QAT) with Low-Rank Adaptation (LoRA)<n>By employing a memory-optimized layer design, L4Q significantly reduces QAT's memory overhead, making its training cost comparable to LoRA.<n>Our experiments demonstrate that this combined approach to quantization and fine-tuning achieves superior accuracy.
arXiv Detail & Related papers (2024-02-07T14:35:05Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.