Related papers: SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models

SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models

URL: http://arxiv.org/abs/2512.14481v1
Date: Tue, 16 Dec 2025 15:12:34 GMT
Title: SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models
Authors: Shizhuo Mao, Song Chen, Yi Kang,
Abstract summary: We propose SASQ: a lightweight QAT framework specifically tailored for activation quantization factors.<n>On LLaMA2-7B, it achieves 5.2% lower perplexity than QuaRot and 4.7% lower perplexity than the FP16 model on WikiText2.
Score: 6.235887167172886
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) excel at natural language tasks but face deployment challenges due to their growing size outpacing GPU memory advancements. Model quantization mitigates this issue by lowering weight and activation precision, but existing solutions face fundamental trade-offs: dynamic quantization incurs high computational overhead and poses deployment challenges on edge devices, while static quantization sacrifices accuracy. Existing approaches of quantization-aware training (QAT) further suffer from weight training costs. We propose SASQ: a lightweight QAT framework specifically tailored for activation quantization factors. SASQ exclusively optimizes only the quantization factors (without changing pre-trained weights), enabling static inference with high accuracy while maintaining deployment efficiency. SASQ adaptively truncates some outliers, thereby reducing the difficulty of quantization while preserving the distributional characteristics of the activations. SASQ not only surpasses existing SOTA quantization schemes but also outperforms the corresponding FP16 models. On LLaMA2-7B, it achieves 5.2% lower perplexity than QuaRot and 4.7% lower perplexity than the FP16 model on WikiText2.

Related papers

D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs [33.883527341335856]
Weight-only post-training quantization (PTQ) is appealing as it reduces memory usage and enables practical speedup without low-bit operators or specialized hardware.<n> accuracy often degrades significantly in weight-only PTQ at sub-4-bit precision.<n>We propose D$2$Quant, a novel weight-only PTQ framework that improves quantization from both the weight and activation perspectives.
arXiv Detail & Related papers (2026-01-30T05:49:48Z)
StableQAT: Stable Quantization-Aware Training at Ultra-Low Bitwidths [49.94623294999562]
Quantization-aware training (QAT) is essential for deploying large models under strict memory and latency constraints.<n>Common approaches based on the straight-through estimator (STE) or soft quantizers often suffer from mismatch, instability, or high computational overhead.<n>We propose StableQAT, a unified and efficient QAT framework that stabilizes training in ultra low-bit settings.
arXiv Detail & Related papers (2026-01-27T08:00:57Z)
End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost [53.25965863436039]
Quantization-aware training (QAT) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs.<n>We propose ZeroQAT, a zeroth-order optimization-based QAT framework that supports both weight and activation quantization.<n>Experiments show that ZeroQAT consistently outperforms representative PTQ and QAT baselines while requiring significantly less memory.
arXiv Detail & Related papers (2025-08-21T01:18:27Z)
LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning [50.89500210372827]
Quantization and fine-tuning are crucial for deploying large language models (LLMs) on resource-constrained edge devices.<n>LoTA-QAF is a novel fine-tuning method specifically designed for quantized LLMs.<n>On the MMLU benchmark, our method effectively recovers performance for quantized models, surpassing 16-bit LoRA by up to 5.14%.
arXiv Detail & Related papers (2025-05-24T14:47:28Z)
Scaling Law for Quantization-Aware Training [41.782744728992675]
Quantization-aware training (QAT) reduces model precision while maintaining performance.<n>Existing QAT scaling laws ignore key factors such as the number of training tokens and quantization granularity.<n>This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size.
arXiv Detail & Related papers (2025-05-20T12:54:43Z)
Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression [55.323397702682506]
Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining.<n>We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery.
arXiv Detail & Related papers (2025-04-10T02:19:03Z)
EfQAT: An Efficient Framework for Quantization-Aware Training [20.47826378511535]
Quantization-aware training (QAT) schemes have been shown to achieve near-full precision accuracy. Post-training quantization (PTQ) schemes do not involve training and are therefore computationally cheap. We propose EfQAT, which generalizes both schemes by optimizing only a subset of the parameters of a quantized model.
arXiv Detail & Related papers (2024-11-17T11:06:36Z)
GWQ: Gradient-Aware Weight Quantization for Large Language Models [56.22507677736051]
Large language models (LLMs) show impressive performance in solving complex language tasks.<n> compressing LLMs to low bits can enable to deploy on resource-constrained devices.<n>We propose gradient-aware weight quantization (GWQ), the first quantization approach for low-bit weight quantization.
arXiv Detail & Related papers (2024-10-30T11:16:04Z)
LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid [36.33062038680275]
Large language models (LLMs) have shown immense potential across various domains. Post-training quantization has emerged as a promising technique to reduce memory requirements and decoding latency. We propose LeanQuant, a novel quantization method that is accurate, versatile, and scalable.
arXiv Detail & Related papers (2024-07-14T00:23:51Z)
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.<n>We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.<n> EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z)
L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models [5.304907804008533]
We propose L4Q, a method that integrates Quantization-Aware Training (QAT) with Low-Rank Adaptation (LoRA)<n>By employing a memory-optimized layer design, L4Q significantly reduces QAT's memory overhead, making its training cost comparable to LoRA.<n>Our experiments demonstrate that this combined approach to quantization and fine-tuning achieves superior accuracy.
arXiv Detail & Related papers (2024-02-07T14:35:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.