Related papers: SiLQ: Simple Large Language Model Quantization-Aware Training

SiLQ: Simple Large Language Model Quantization-Aware Training

URL: http://arxiv.org/abs/2507.16933v1
Date: Tue, 22 Jul 2025 18:17:53 GMT
Title: SiLQ: Simple Large Language Model Quantization-Aware Training
Authors: Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, Dharmendra S. Modha,
Abstract summary: Large language models can be quantized to reduce inference time latency, model size, and energy consumption.<n>A challenge exists to deliver quantized models with minimal loss of accuracy in reasonable time.<n>Here, we demonstrate a simple, end-to-end quantization-aware training approach that outperforms the leading published quantization methods.
Score: 3.09578981466695
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models can be quantized to reduce inference time latency, model size, and energy consumption, thereby delivering a better user experience at lower cost. A challenge exists to deliver quantized models with minimal loss of accuracy in reasonable time, and in particular to do so without requiring mechanisms incompatible with specialized inference accelerators. Here, we demonstrate a simple, end-to-end quantization-aware training approach that, with an increase in total model training budget of less than 0.1%, outperforms the leading published quantization methods by large margins on several modern benchmarks, with both base and instruct model variants. The approach easily generalizes across different model architectures, can be applied to activations, cache, and weights, and requires the introduction of no additional operations to the model other than the quantization itself.

Related papers

Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining [0.0]
We propose an ultra-low-bit quantization method that builds upon ApiQ and extends its performance without the need for full retraining.<n>Our experiments on LLaMA 7B and 13B benchmarks demonstrate that our method reduces the ApiQ's accuracy degradation by 10.85% and 7.54% respectively.
arXiv Detail & Related papers (2025-04-14T19:31:21Z)
Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models [48.98109982725689]
We conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families.<n>Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths.<n>We identify model size, model origin, and task difficulty as critical determinants of performance.
arXiv Detail & Related papers (2025-04-07T08:22:45Z)
Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation [70.22782550540714]
Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
arXiv Detail & Related papers (2024-08-07T12:42:09Z)
LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid [36.33062038680275]
Large language models (LLMs) have shown immense potential across various domains. Post-training quantization has emerged as a promising technique to reduce memory requirements and decoding latency. We propose LeanQuant, a novel quantization method that is accurate, versatile, and scalable.
arXiv Detail & Related papers (2024-07-14T00:23:51Z)
Optimization of DNN-based speaker verification model through efficient quantization technique [15.250677730668466]
Quantization of deep models offers a means to reduce both computational and memory expenses. Our research proposes an optimization framework for the quantization of the speaker verification model.
arXiv Detail & Related papers (2024-07-12T05:03:10Z)
Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models. We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models. We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z)
AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models [0.18416014644193066]
AWEQ excels in both ultra-low-bit quantization and 8-bit weight and activation (W8A8) quantization. We have further refined the equalization method to mitigate quantization bias error, ensuring the robustness of the model.
arXiv Detail & Related papers (2023-11-02T15:18:22Z)
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant. PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error. We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z)
Vertical Layering of Quantized Neural Networks for Heterogeneous Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z)
Genie: Show Me the Data for Quantization [2.7286395031146062]
We introduce a post-training quantization scheme for zero-shot quantization that produces high-quality quantized networks within a few hours. We also propose a post-training quantization algorithm to enhance the performance of quantized models.
arXiv Detail & Related papers (2022-12-09T11:18:40Z)
Zero-shot Adversarial Quantization [11.722728148523366]
We propose a zero-shot adversarial quantization (ZAQ) framework, facilitating effective discrepancy estimation and knowledge transfer. This is achieved by a novel two-level discrepancy modeling to drive a generator to synthesize informative and diverse data examples. We conduct extensive experiments on three fundamental vision tasks, demonstrating the superiority of ZAQ over the strong zero-shot baselines.
arXiv Detail & Related papers (2021-03-29T01:33:34Z)
An Investigation on Different Underlying Quantization Schemes for Pre-trained Language Models [33.49417100179159]
We implement k-means quantization and compare its performance on the fix-precision quantization of BERT with linear quantization. We also compare the two quantization schemes on ALBERT models to explore the robustness differences between different pre-trained models.
arXiv Detail & Related papers (2020-10-14T14:05:06Z)
When Ensembling Smaller Models is More Efficient than Single Large Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute. This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.