LATMiX: Learnable Affine Transformations for Microscaling Quantization of LLMs
- URL: http://arxiv.org/abs/2602.17681v1
- Date: Wed, 04 Feb 2026 15:32:27 GMT
- Title: LATMiX: Learnable Affine Transformations for Microscaling Quantization of LLMs
- Authors: Ofir Gordon, Lior Dikstein, Arnon Netzer, Idan Achituve, Hai Victor Habi,
- Abstract summary: Applying invertible transformations to activations can significantly improve quantization.<n>Modern hardware increasingly supports the microscaling (MX) data format.<n>We propose LATMiX, a method that generalizes outlier reduction to learnable invertible affine transformations.
- Score: 11.773543873657752
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Post-training quantization (PTQ) is a widely used approach for reducing the memory and compute costs of large language models (LLMs). Recent studies have shown that applying invertible transformations to activations can significantly improve quantization robustness by reducing activation outliers; however, existing approaches are largely restricted to rotation or Hadamard-based transformations. Moreover, most studies focused primarily on traditional quantization schemes, whereas modern hardware increasingly supports the microscaling (MX) data format. Attempts to combine both showed severe performance degradation, leading prior work to introduce assumptions on the transformations. In this work, we take a complementary perspective. First, we provide a theoretical analysis of transformations under MX quantization by deriving a bound on the quantization error. Our analysis emphasizes the importance of accounting for both the activation distribution and the underlying quantization structure. Building on this analysis, we propose LATMiX, a method that generalizes outlier reduction to learnable invertible affine transformations optimized using standard deep learning tools. Experiments show consistent improvements in average accuracy for MX low-bit quantization over strong baselines on a wide range of zero-shot benchmarks, across multiple model sizes.
Related papers
- WUSH: Near-Optimal Adaptive Transforms for LLM Quantization [52.77441224845925]
Quantization to low bitwidth is a standard approach for deploying large language models.<n>A few extreme weights and activations stretch the dynamic range and reduce the effective resolution of the quantizer.<n>We derive, for the first time, closed-form optimal linear blockwise transforms for joint weight-activation quantization.
arXiv Detail & Related papers (2025-11-30T16:17:34Z) - STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization [21.93314755695813]
Quantization is the key method for reducing inference latency, power and memory footprint of generative AI models.<n>We propose textitSequence Transformation and Mixed Precision (STaMP) quantization.
arXiv Detail & Related papers (2025-10-30T17:53:42Z) - Mixed-Precision Quantization for Language Models: Techniques and Prospects [10.345914140081925]
Quantization has emerged as an essential compression technique to reduce model size, alleviate memory bottlenecks, and accelerate inference.<n>Mixed-precision quantization offers a promising alternative by selectively allocating precision across layers or within tensors to balance efficiency and accuracy.
arXiv Detail & Related papers (2025-10-19T12:16:40Z) - A Comprehensive Evaluation on Quantization Techniques for Large Language Models [46.75040730001041]
Post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead for large language models (LLMs)<n>We conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations under the same conditions for fair comparison.<n>We evaluate and evaluate the latest MXFP4 and NVFP4 data formats and their performance.
arXiv Detail & Related papers (2025-07-23T11:21:21Z) - FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation [55.12070409045766]
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years.<n>Current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization.
arXiv Detail & Related papers (2025-06-13T07:57:38Z) - HadaNorm: Diffusion Transformer Quantization through Mean-Centered Transformations [17.975720202894905]
Post-Training Quantization (PTQ) offers a promising solution by reducing the bitwidth of matrix operations.<n>We propose HadaNorm, a novel linear transformation that extends existing approaches by both normalizing channels activations and applying Hadamard transforms.<n>We demonstrate that HadaNorm consistently reduces quantization error across the various components of transformer blocks, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2025-06-11T16:54:34Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - QT-DoG: Quantization-aware Training for Domain Generalization [58.439816306817306]
We propose Quantization-aware Training for Domain Generalization (QT-DoG)<n>We demonstrate that weight quantization effectively leads to flatter minima in the loss landscape.<n> QT-DoG exploits quantization as an implicit regularizer by inducing noise in model weights.
arXiv Detail & Related papers (2024-10-08T13:21:48Z) - Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [51.32182730502002]
We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T09:45:16Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - RepQuant: Towards Accurate Post-Training Quantization of Large
Transformer Models via Scale Reparameterization [8.827794405944637]
Post-training quantization (PTQ) is a promising solution for compressing large transformer models.
Existing PTQ methods typically exhibit non-trivial performance loss.
We propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm.
arXiv Detail & Related papers (2024-02-08T12:35:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.