Related papers: Beacon: Post-Training Quantization with Integrated Grid Selection

Beacon: Post-Training Quantization with Integrated Grid Selection

URL: http://arxiv.org/abs/2508.20293v2
Date: Thu, 04 Sep 2025 05:03:16 GMT
Title: Beacon: Post-Training Quantization with Integrated Grid Selection
Authors: Shihao Zhang, Rayan Saab,
Abstract summary: Key challenge in per-channel post-training quantization is selecting appropriate scaling factors.<n>We propose Beacon, a simple and effective algorithm that eliminates the need for such manual tuning.<n>Beacon achieves competitive performance compared to state-of-the-art methods.
Score: 5.886065213861507
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Quantization is a widely used compression technique for reducing the memory and computation costs of large pre-trained models. A key challenge in per-channel post-training quantization (PTQ) is selecting appropriate scaling factors to replace weight values with values from a scaled integer grid. Existing methods typically fix the scale at the outset via heuristic tuning or grid search. We propose Beacon, a simple and effective algorithm that eliminates the need for such manual tuning. Beacon performs per-channel PTQ directly using an unscaled grid and automatically determines the optimal scaling factors by exploiting the geometry of scalar quantization. It does not rely on back-propagation or large calibration sets. Despite its simplicity and tuning-free nature, Beacon achieves competitive performance compared to state-of-the-art methods, making it a practical solution for efficient model deployment.

Related papers

Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression [57.54335545892155]
We introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook.<n>Our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines.
arXiv Detail & Related papers (2025-10-23T20:19:48Z)
GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration [21.474315621757594]
We introduce GPTAQ, a novel finetuning-free quantization method for compressing large-scale transformer architectures.<n>Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer's output to the exact output in the full-precision model.<n>GPTAQ is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization.
arXiv Detail & Related papers (2025-04-03T15:30:43Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution [95.98801201266099]
Diffusion-based image super-resolution (SR) models have shown superior performance at the cost of multiple denoising steps.<n>We propose a novel post-training quantization approach with adaptive scale in one-step diffusion (OSD) image SR, PassionSR.<n>Our PassionSR achieves significant advantages over recent leading low-bit quantization methods for image SR.
arXiv Detail & Related papers (2024-11-26T04:49:42Z)
Gradient-Based Post-Training Quantization: Challenging the Status Quo [23.1120983784623]
Quantization has become a crucial step for the efficient deployment of deep neural networks. In this work, we show that the process is, to a certain extent, robust to a number of variables. We derive a number of best practices for designing more efficient and scalable GPTQ methods.
arXiv Detail & Related papers (2023-08-15T09:25:11Z)
Compressing Volumetric Radiance Fields to 1 MB [19.380248980850727]
Approximating radiance fields with volumetric grids is one of promising directions for improving NeRF. We introduce a simple yet effective framework, called vector quantized radiance fields (VQRF), for compressing these volume-grid-based radiance fields. In combination with an efficient joint tuning strategy and post-processing, our method can achieve a compression ratio of 100$times$ by reducing the overall model size to 1 MB.
arXiv Detail & Related papers (2022-11-29T17:11:25Z)
Effective Invertible Arbitrary Image Rescaling [77.46732646918936]
Invertible Neural Networks (INN) are able to increase upscaling accuracy significantly by optimizing the downscaling and upscaling cycle jointly. A simple and effective invertible arbitrary rescaling network (IARN) is proposed to achieve arbitrary image rescaling by training only one model in this work. It is shown to achieve a state-of-the-art (SOTA) performance in bidirectional arbitrary rescaling without compromising perceptual quality in LR outputs.
arXiv Detail & Related papers (2022-09-26T22:22:30Z)
OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization [32.60139548889592]
We propose a novel One-shot Pruning-Quantization (OPQ) in this paper. OPQ analytically solves the compression allocation with pre-trained weight parameters only. We propose a unified channel-wise quantization method that enforces all channels of each layer to share a common codebook.
arXiv Detail & Related papers (2022-05-23T09:05:25Z)
Channel Scaling: A Scale-and-Select Approach for Transfer Learning [2.6304695993930594]
Transfer learning with pre-trained neural networks is a common strategy for training classifiers in medical image analysis. We propose a novel approach to efficiently build small and well performing networks by introducing the channel-scaling layers. By imposing L1 regularization and thresholding on the scaling weights, this framework iteratively removes unnecessary feature channels from a pre-trained model.
arXiv Detail & Related papers (2021-03-22T23:26:57Z)
Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
Gradient $\ell_1$ Regularization for Quantization Robustness [70.39776106458858]
We derive a simple regularization scheme that improves robustness against post-training quantization. By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths.
arXiv Detail & Related papers (2020-02-18T12:31:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.