Related papers: Scaling laws for post-training quantized large language models

Scaling laws for post-training quantized large language models

URL: http://arxiv.org/abs/2410.12119v2
Date: Thu, 17 Oct 2024 22:40:13 GMT
Title: Scaling laws for post-training quantized large language models
Authors: Zifei Xu, Alexander Lan, Wanzin Yazar, Tristan Webb, Sayeh Sharify, Xin Wang,
Abstract summary: Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size. The quality of LLMs after post-training compression remains highly unpredictable, often requiring case-by-case validation in practice.
Score: 41.78467383320145
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size. In contrast to the existence of practical scaling laws governing pre-training, the quality of LLMs after post-training compression remains highly unpredictable, often requiring case-by-case validation in practice. In this work, we attempted to close this gap for post-training weight quantization of LLMs by conducting a systematic empirical study on multiple LLM families quantized to numerous low-precision tensor data types using popular weight quantization techniques. We identified key scaling factors pertaining to characteristics of the local loss landscape, based on which the performance of quantized LLMs can be reasonably well predicted by a statistical model.

Related papers

RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE) RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Scaling Law for Language Models Training Considering Batch Size [17.09348741898811]
Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. We empirically investigate how a critical hyper- parameter, i.e., the global batch size, influences the LLM training prdocess. We establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models.
arXiv Detail & Related papers (2024-12-02T13:58:35Z)
A Comprehensive Study on Quantization Techniques for Large Language Models [0.0]
Large Language Models (LLMs) have been extensively researched and used in both academia and industry. LLMs present significant challenges for deployment on resource-constrained IoT devices and embedded systems. Quantization, a technique that reduces the precision of model values to a smaller set of discrete values, offers a promising solution.
arXiv Detail & Related papers (2024-10-30T04:55:26Z)
Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation [70.22782550540714]
Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
arXiv Detail & Related papers (2024-08-07T12:42:09Z)
Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [51.32182730502002]
We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T09:45:16Z)
A Comprehensive Evaluation of Quantization Strategies for Large Language Models [42.03804933928227]
Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs. Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular. We propose a structured evaluation framework consisting of three critical dimensions: knowledge & capacity, (2) alignment, and (3) efficiency.
arXiv Detail & Related papers (2024-02-26T17:45:36Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks. However, the massive size of these models poses huge challenges for their deployment in real-world applications. We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z)
PB-LLM: Partially Binarized Large Language Models [14.244537605866864]
This paper explores network binarization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression. We propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs.
arXiv Detail & Related papers (2023-09-29T14:35:27Z)
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models. Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation. To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z)
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant. PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error. We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z)
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models [38.76165207636793]
We propose a data-free distillation method that leverages generations produced by the pre-trained model. In addition to quantizing weights and activations, we also quantize the KV cache, which is critical for increasing throughput. We experiment with LLaMA models of sizes 7B, 13B, and 30B, at quantization levels down to 4-bits.
arXiv Detail & Related papers (2023-05-29T05:22:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.