Related papers: LoaQ: Layer-wise Output Approximation Quantization

LoaQ: Layer-wise Output Approximation Quantization

URL: http://arxiv.org/abs/2509.06297v1
Date: Mon, 08 Sep 2025 02:50:11 GMT
Title: LoaQ: Layer-wise Output Approximation Quantization
Authors: Li Lin, Xiaojun Wan,
Abstract summary: A natural idea in model quantization is to approximate each component's quantized output to match its original.<n>Layer-wise post-training quantization (PTQ) can achieve, at best, only activation-aware approximations of weights.<n>We propose $LoaQ$, an output-approximation method for layer-wise PTQ that explicitly targets output-level consistency.
Score: 40.425536766574844
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A natural and intuitive idea in model quantization is to approximate each component's quantized output to match its original. Layer-wise post-training quantization (PTQ), though based on this idea, adopts a strictly local view and can achieve, at best, only activation-aware approximations of weights. As a result, it often leads to insufficient approximations and practical deviations from this guiding intuition. Recent work has achieved a more accurate approximation of linear-layer outputs within the framework of layer-wise PTQ, but such refinements remain inadequate for achieving alignment with the full model output. Based on a deeper understanding of the structural characteristics of mainstream LLMs, we propose $LoaQ$, an output-approximation method for layer-wise PTQ that explicitly targets output-level consistency. It better aligns with this intuition and can feature a simple closed-form solution, making it orthogonal to existing techniques and readily integrable into existing quantization pipelines. Experiments on the LLaMA and Qwen model families demonstrate that LoaQ performs effectively in both weight-only and weight-activation joint quantization. By integrating seamlessly with existing quantization strategies, it further enhances overall quantization quality and shows strong potential to advance the frontier of post-training quantization.

Related papers

LSGQuant: Layer-Sensitivity Guided Quantization for One-Step Diffusion Real-World Video Super-Resolution [52.627063566555194]
We introduce LSGQuant, a layer-sensitivity guided quantizing approach for one-step diffusion-based real-world VSR.<n>Our method incorporates a Dynamic Range Adaptive Quantizer (DRAQ) to fit video token activations.<n>Our method has nearly performance to origin model with full-precision and significantly exceeds existing quantization techniques.
arXiv Detail & Related papers (2026-02-03T06:53:19Z)
Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models [41.677469535447024]
Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices.<n>Post-training quantization (PTQ) is widely adopted for its efficiency, as it requires no retraining and only a small dataset for calibration.<n>Recent advances for post-training quantization have demonstrated that even sub-4-bit methods can maintain most of the original model performance.
arXiv Detail & Related papers (2025-12-25T12:39:36Z)
QiNN-QJ: A Quantum-inspired Neural Network with Quantum Jump for Multimodal Sentiment Analysis [11.46663985298648]
We propose a Quantum-inspired Neural Network with Quantum Jump (QiNN-QJ) for multimodal entanglement modelling.<n>By jointly Hamiltonian and Lindblad operators, QiNN-QJ generates controllable cross-modal entanglement.<n>This work establishes a principled framework for entangled multimodal fusion and paves the way for quantum-inspired approaches in modelling complex cross-modal correlations.
arXiv Detail & Related papers (2025-10-31T01:25:55Z)
A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA [65.38186593873313]
Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise.<n>We introduce a proof-of-concept multi-call framework for MHQA, InfoQA.<n>We construct a stringent and noise-rich benchmark to validate our theory and framework.
arXiv Detail & Related papers (2025-09-25T14:11:57Z)
ZeroQAT: Your Quantization-aware Training but Efficient [53.25965863436039]
Quantization is an effective technique to reduce the deployment cost of large language models (LLMs)<n>Existing low-bit PTQ methods suffer from accuracy degradation because their layer-wise optimization introduces cumulative error propagation and misalignment between local reconstruction objectives and downstream performance.<n>We propose ZeroQAT, a zeroth-order optimization-based QAT framework.
arXiv Detail & Related papers (2025-08-21T01:18:27Z)
Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization [0.0]
Layer-wise PTQ is a promising technique for compressing large language models (LLMs)<n>Recent progress in this area is saturating, underscoring the need to revisit its core limitations and explore further improvements.<n>We propose Quantization Error Propagation (QEP), a general, lightweight, and scalable framework that enhances layer-wise PTQ by explicitly propagating quantization errors and compensating for accumulated errors.
arXiv Detail & Related papers (2025-04-13T15:56:00Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [51.32182730502002]
We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T09:45:16Z)
PB-LLM: Partially Binarized Large Language Models [14.244537605866864]
This paper explores network binarization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression. We propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs.
arXiv Detail & Related papers (2023-09-29T14:35:27Z)
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations. Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z)
QFT: Post-training quantization via fast joint finetuning of all degrees of freedom [1.1744028458220428]
We rethink quantized network parameterization in HW-aware fashion, towards a unified analysis of all quantization DoF. Our single-step simple and extendable method, dubbed quantization-aware finetuning (QFT), achieves 4-bit weight quantization results on-par with SoTA.
arXiv Detail & Related papers (2022-12-05T22:38:58Z)
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation [48.838691414561694]
Nonuniform-to-Uniform Quantization (N2UQ) is a method that can maintain the strong representation ability of nonuniform methods while being hardware-friendly and efficient. N2UQ outperforms state-of-the-art nonuniform quantization methods by 0.71.8% on ImageNet.
arXiv Detail & Related papers (2021-11-29T18:59:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.