Related papers: NeUQI: Near-Optimal Uniform Quantization Parameter Initialization

NeUQI: Near-Optimal Uniform Quantization Parameter Initialization

URL: http://arxiv.org/abs/2505.17595v2
Date: Tue, 27 May 2025 04:25:03 GMT
Title: NeUQI: Near-Optimal Uniform Quantization Parameter Initialization
Authors: Li Lin, Xinyu Hu, Xiaojun Wan,
Abstract summary: Post-training quantization of large language models (LLMs) offers a promising solution that reduces their memory footprint and decoding latency.<n>Recent studies on $geq 2$-bit uniform quantization have led to noticeable improvements in post-quantization model performance.<n>We propose NeUQI, a method devoted to efficiently determining near-optimal initial parameters for uniform quantization.
Score: 41.08779476737888
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumer-grade GPUs or personal devices such as laptops, due to high memory consumption and inference costs. Post-training quantization (PTQ) of LLMs offers a promising solution that reduces their memory footprint and decoding latency. In practice, PTQ with uniform quantization representation is favored for its efficiency and ease of deployment since uniform quantization is widely supported by mainstream hardware and software libraries. Recent studies on $\geq 2$-bit uniform quantization have led to noticeable improvements in post-quantization model performance; however, they primarily focus on quantization methodologies, while the initialization of quantization parameters is underexplored and still relies on the suboptimal Min-Max strategies. In this work, we propose NeUQI, a method devoted to efficiently determining near-optimal initial parameters for uniform quantization. NeUQI is orthogonal to prior quantization methodologies and can seamlessly integrate with them. The experiments with the LLaMA and Qwen families on various tasks demonstrate that our NeUQI consistently outperforms existing methods. Furthermore, when combined with a lightweight distillation strategy, NeUQI can achieve superior performance to PV-tuning, a much more resource-intensive approach.

Related papers

MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models [2.1388885579612804]
GANQ is a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM.<n>Extensive experiments demonstrate GANQ's ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization.
arXiv Detail & Related papers (2025-01-22T15:29:09Z)
Q-MAML: Quantum Model-Agnostic Meta-Learning for Variational Quantum Algorithms [4.525216077859531]
We introduce a new framework for optimizing parameterized quantum circuits (PQCs) that employs a classical, inspired by Model-Agnostic Meta-Learning (MAML) technique.<n>Our framework features a classical neural network, called Learner, which interacts with a PQC using the output of Learner as an initial parameter.<n>In the adaptation phase, the framework requires only a few PQC updates to converge to a more accurate value, while the learner remains unchanged.
arXiv Detail & Related papers (2025-01-10T12:07:00Z)
Channel-Wise Mixed-Precision Quantization for Large Language Models [47.00361921910259]
Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks.<n>Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs.<n>We introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method.
arXiv Detail & Related papers (2024-10-16T21:34:41Z)
QSpec: Speculative Decoding with Complementary Quantization Schemes [37.007621357142725]
Quantization has been substantially adopted to accelerate inference and reduce memory consumption of large language models.<n>We propose a novel quantization paradigm called QSPEC, which seamlessly integrates two complementary quantization schemes for speculative decoding.
arXiv Detail & Related papers (2024-10-15T05:57:51Z)
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.<n>We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.<n> EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models [88.80146574509195]
Quantization is a promising approach for reducing memory overhead and accelerating inference. We propose a novel-aware quantization (ZSAQ) framework for the zero-shot quantization of various PLMs.
arXiv Detail & Related papers (2023-10-20T07:09:56Z)
QuantEase: Optimization-based Quantization for Language Models [17.333778751252392]
This work introduces Quantization (PTQ) of various quantization layers from recent advances of Large Language Models (LLMs) Our CD-based approach features straightforward updates, relying solely on vector operations. We also explore an outlier approach, allowing for retaining significant weights (outoutliers) with complete precision.
arXiv Detail & Related papers (2023-09-05T01:39:09Z)
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant. PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error. We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z)
NIPQ: Noise proxy-based Integrated Pseudo-Quantization [9.207644534257543]
Straight-through estimator (STE) incurs unstable convergence during quantization-aware training (QAT) We propose a novel noise proxy-based integrated pseudoquantization (NIPQ) that enables unified support of pseudoquantization for both activation and weight. NIPQ outperforms existing quantization algorithms in various vision and language applications by a large margin.
arXiv Detail & Related papers (2022-06-02T01:17:40Z)
FLIP: A flexible initializer for arbitrarily-sized parametrized quantum circuits [105.54048699217668]
We propose a FLexible Initializer for arbitrarily-sized Parametrized quantum circuits. FLIP can be applied to any family of PQCs, and instead of relying on a generic set of initial parameters, it is tailored to learn the structure of successful parameters. We illustrate the advantage of using FLIP in three scenarios: a family of problems with proven barren plateaus, PQC training to solve max-cut problem instances, and PQC training for finding the ground state energies of 1D Fermi-Hubbard models.
arXiv Detail & Related papers (2021-03-15T17:38:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.