Related papers: Activation Sensitivity as a Unifying Principle for Post-Training Quantization

Activation Sensitivity as a Unifying Principle for Post-Training Quantization

URL: http://arxiv.org/abs/2601.11663v1
Date: Thu, 15 Jan 2026 22:27:19 GMT
Title: Activation Sensitivity as a Unifying Principle for Post-Training Quantization
Authors: Bruce Changlong Xu,
Abstract summary: Post-training quantization (PTQ) methods for large language models rely on prioritizing which weight channels most strongly influence model behavior.<n>We present a unified theoretical framework for PTQ by formalizing activation sensitivity, defined as the expected impact of channel-wise perturbations on the loss.<n>We analyze the design space of sensitivity metrics, connect gradient-based saliency, Fisher information, and Hessian-based criteria, and clarify their relationships to classical pruning methods such as Optimal Brain Damage and Optimal Brain Surgeon.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-training quantization (PTQ) methods for large language models rely on heuristics that implicitly estimate which weight channels most strongly influence model behavior. Two dominant paradigms have emerged: activation-aware methods such as AWQ prioritize channels with large activation magnitudes, while second-order methods such as GPTQ allocate quantization error according to input covariance structure. Despite strong empirical performance, these approaches remain conceptually fragmented, and it is unclear what underlying quantity they are approximating. In this work, we present a unified theoretical framework for PTQ by formalizing activation sensitivity, defined as the expected impact of channel-wise perturbations on the loss. Using a first-order Taylor expansion, we show that sensitivity naturally arises as the squared norm of gradient-weighted activations, yielding a principled measure of channel importance that captures both activation magnitude and downstream error propagation. Within this framework, AWQ and GPTQ can be interpreted as complementary approximations that recover sensitivity under distinct simplifying assumptions. We analyze the design space of sensitivity metrics, connect gradient-based saliency, Fisher information, and Hessian-based criteria, and clarify their relationships to classical pruning methods such as Optimal Brain Damage and Optimal Brain Surgeon. Rather than proposing a new quantization algorithm, this work provides a conceptual foundation for understanding and comparing post-training quantization methods through the lens of sensitivity.

Related papers

Optimal Effective Hamiltonian for Quantum Computing and Simulation [1.0359978670015826]
We establish the Least Action Unitary Transformation as the fundamental principle for effective models.<n>We validate this framework against experimental data from superconducting quantum processors.<n>This work provides a systematic, experimentally validated route for high-precision system learning and Hamiltonian engineering.
arXiv Detail & Related papers (2026-02-03T15:09:29Z)
Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z)
FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation [55.12070409045766]
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years.<n>Current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization.
arXiv Detail & Related papers (2025-06-13T07:57:38Z)
Precision Neural Network Quantization via Learnable Adaptive Modules [27.323901068182234]
Quantization Aware Training (QAT) is a neural network quantization technique that compresses model size and improves operational efficiency.<n>We propose an effective learnable adaptive neural network quantization method, called Adaptive Step Size Quantization (ASQ)
arXiv Detail & Related papers (2025-04-24T05:46:25Z)
Unifying Perplexing Behaviors in Modified BP Attributions through Alignment Perspective [61.5509267439999]
We present a unified theoretical framework for methods like GBP, RectGrad, LRP, and DTD.<n>We demonstrate that they achieve input alignment by combining the weights of activated neurons.<n>This alignment improves the visualization quality and reduces sensitivity to weight randomization.
arXiv Detail & Related papers (2025-03-14T07:58:26Z)
Deep Weight Factorization: Sparse Learning Through the Lens of Artificial Symmetries [10.209740962369453]
Sparse regularization techniques are well-established in machine learning, yet their application in neural networks remains challenging.<n>A promising alternative is shallow weight factorization, where weights are pruning into two factors, allowing for optimization of $L$penalized neural networks.<n>In this work, we introduce deep weight factorization, adding differenti factors to more than two previous approaches.
arXiv Detail & Related papers (2025-02-04T17:12:56Z)
A Theoretical Perspective for Speculative Decoding Algorithm [60.79447486066416]
One effective way to accelerate inference is emphSpeculative Decoding, which employs a small model to sample a sequence of draft tokens and a large model to validate. This paper tackles this gap by conceptualizing the decoding problem via markov chain abstraction and studying the key properties, emphoutput quality and inference acceleration, from a theoretical perspective.
arXiv Detail & Related papers (2024-10-30T01:53:04Z)
Quantization Aware Factorization for Deep Neural Network Compression [20.04951101799232]
decomposition of convolutional and fully-connected layers is an effective way to reduce parameters and FLOP in neural networks. A conventional post-training quantization approach applied to networks with weights yields a drop in accuracy. This motivated us to develop an algorithm that finds decomposed approximation directly with quantized factors.
arXiv Detail & Related papers (2023-08-08T21:38:02Z)
Where Should We Begin? A Low-Level Exploration of Weight Initialization Impact on Quantized Behaviour of Deep Neural Networks [93.4221402881609]
We present an in-depth, fine-grained ablation study of the effect of different weights initialization on the final distributions of weights and activations of different CNN architectures. To our best knowledge, we are the first to perform such a low-level, in-depth quantitative analysis of weights initialization and its effect on quantized behaviour.
arXiv Detail & Related papers (2020-11-30T06:54:28Z)
Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix. Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.