Related papers: QFT: Post-training quantization via fast joint finetuning of all degrees of freedom

Related papers

LoaQ: Layer-wise Output Approximation Quantization [40.425536766574844]
A natural idea in model quantization is to approximate each component's quantized output to match its original.<n>Layer-wise post-training quantization (PTQ) can achieve, at best, only activation-aware approximations of weights.<n>We propose $LoaQ$, an output-approximation method for layer-wise PTQ that explicitly targets output-level consistency.
arXiv Detail & Related papers (2025-09-08T02:50:11Z)
Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos [11.469337174377046]
Post-training quantization (PTQ) has become a crucial tool for reducing the memory and compute costs of modern deep neural networks.<n>OPTQ framework-also known as GPTQ-has emerged as a leading method due to its computational efficiency and strong empirical performance.<n>Despite its widespread adoption, OPTQ lacks rigorous quantitative theoretical guarantees.
arXiv Detail & Related papers (2025-08-06T20:00:40Z)
FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation [55.12070409045766]
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years.<n>Current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization.
arXiv Detail & Related papers (2025-06-13T07:57:38Z)
Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning [14.145862114439831]
Model quantization reduces the bit-width of weights and activations, improving memory efficiency and inference speed.<n>Existing methods, primarily based on integer quantization and post-training quantization fine-tuning, struggle with inconsistent performance.<n>We propose a mixup-sign floating-point quantization framework, first introducing unsigned FP quantization in model quantization, along with timestep-aware LoRA and denoising-factor loss alignment.
arXiv Detail & Related papers (2025-05-27T13:40:47Z)
PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time. Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z)
QSpec: Speculative Decoding with Complementary Quantization Schemes [37.007621357142725]
Quantization has been substantially adopted to accelerate inference and reduce memory consumption of large language models. We propose a novel quantization paradigm called QSPEC, which seamlessly integrates two complementary quantization schemes for speculative decoding. QSPEC empirically boosts token generation throughput by up to 1.80x without any quality compromise.
arXiv Detail & Related papers (2024-10-15T05:57:51Z)
L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models [5.304907804008533]
We propose L4Q, a method that integrates Quantization-Aware Training (QAT) with Low-Rank Adaptation (LoRA) to effectively reduce quantization error. By employing a memory-optimized layer design, L4Q significantly reduces QAT's memory overhead while producing fully-quantized weights.
arXiv Detail & Related papers (2024-02-07T14:35:05Z)
MRQ:Support Multiple Quantization Schemes through Model Re-Quantization [0.17499351967216337]
Deep learning models cannot be easily quantized for diverse fixed-point hardwares. New type of model quantization approach called model re-quantization is proposed. Models obtained from the re-quantization process have been successfully deployed on NNA in the Echo Show devices.
arXiv Detail & Related papers (2023-08-01T08:15:30Z)
Quantized Feature Distillation for Network Quantization [32.26577845735846]
Methods adopting the quantization aware training (QAT) paradigm have recently seen a rapid growth, but are often conceptually complicated. This paper proposes a novel and highly effective QAT method, quantized feature distillation (QFD) QFD first trains a quantized (or binarized) representation as the teacher, then quantize the network using knowledge distillation (KD)
arXiv Detail & Related papers (2023-07-20T07:08:24Z)
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant. PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error. We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z)
Distribution-Flexible Subset Quantization for Post-Quantizing Super-Resolution Networks [68.83451203841624]
This paper introduces Distribution-Flexible Subset Quantization (DFSQ), a post-training quantization method for super-resolution networks. DFSQ conducts channel-wise normalization of the activations and applies distribution-flexible subset quantization (SQ) It achieves comparable performance to full-precision counterparts on 6- and 8-bit quantization, and incurs only a 0.1 dB PSNR drop on 4-bit quantization.
arXiv Detail & Related papers (2023-05-10T04:19:11Z)
A self-consistent field approach for the variational quantum eigensolver: orbital optimization goes adaptive [52.77024349608834]
We present a self consistent field approach (SCF) within the Adaptive Derivative-Assembled Problem-Assembled Ansatz Variational Eigensolver (ADAPTVQE) This framework is used for efficient quantum simulations of chemical systems on nearterm quantum computers.
arXiv Detail & Related papers (2022-12-21T23:15:17Z)
NIPQ: Noise proxy-based Integrated Pseudo-Quantization [9.207644534257543]
Straight-through estimator (STE) incurs unstable convergence during quantization-aware training (QAT) We propose a novel noise proxy-based integrated pseudoquantization (NIPQ) that enables unified support of pseudoquantization for both activation and weight. NIPQ outperforms existing quantization algorithms in various vision and language applications by a large margin.
arXiv Detail & Related papers (2022-06-02T01:17:40Z)
Cluster-Promoting Quantization with Bit-Drop for Minimizing Network Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks. DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons. We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z)
A Statistical Framework for Low-bitwidth Training of Deep Neural Networks [70.77754244060384]
Fully quantized training (FQT) uses low-bitwidth hardware by quantizing the activations, weights, and gradients of a neural network model. One major challenge with FQT is the lack of theoretical understanding, in particular of how gradient quantization impacts convergence properties.
arXiv Detail & Related papers (2020-10-27T13:57:33Z)
Gradient $\ell_1$ Regularization for Quantization Robustness [70.39776106458858]
We derive a simple regularization scheme that improves robustness against post-training quantization. By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths.
arXiv Detail & Related papers (2020-02-18T12:31:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.