LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit
- URL: http://arxiv.org/abs/2405.06001v3
- Date: Wed, 09 Oct 2024 06:09:41 GMT
- Title: LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit
- Authors: Ruihao Gong, Yang Yong, Shiqiao Gu, Yushi Huang, Chengtao Lv, Yunchen Zhang, Xianglong Liu, Dacheng Tao,
- Abstract summary: Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
- Score: 55.73370804397226
- License:
- Abstract: Recent advancements in large language models (LLMs) are propelling us toward artificial general intelligence with their remarkable emergent abilities and reasoning capabilities. However, the substantial computational and memory requirements limit the widespread adoption. Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating LLMs, albeit with potential risks to accuracy. Numerous studies have aimed to minimize the accuracy loss associated with quantization. However, their quantization configurations vary from each other and cannot be fairly compared. In this paper, we present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. LLMC integrates dozens of algorithms, models, and hardwares, offering high extensibility from integer to floating-point quantization, from LLM to vision-language (VLM) model, from fixed-bit to mixed precision, and from quantization to sparsification. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats, providing novel insights and detailed analyses for further research and practical guidance for users. Our toolkit is available at https://github.com/ModelTC/llmc.
Related papers
- A Comprehensive Study on Quantization Techniques for Large Language Models [0.0]
Large Language Models (LLMs) have been extensively researched and used in both academia and industry.
LLMs present significant challenges for deployment on resource-constrained IoT devices and embedded systems.
Quantization, a technique that reduces the precision of model values to a smaller set of discrete values, offers a promising solution.
arXiv Detail & Related papers (2024-10-30T04:55:26Z) - A Survey of Small Language Models [104.80308007044634]
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources.
We present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques.
arXiv Detail & Related papers (2024-10-25T23:52:28Z) - Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox [46.39670209441478]
Large language models (LLMs) have exhibited exciting progress in multiple scenarios.
As an effective means to reduce memory footprint and inference cost, quantization also faces challenges in performance degradation at low bit-widths.
This work provides a comprehensive benchmark suite for this research topic, including an evaluation system, detailed analyses, and a general toolbox.
arXiv Detail & Related papers (2024-06-15T12:02:14Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - The Cost of Compression: Investigating the Impact of Compression on
Parametric Knowledge in Language Models [11.156816338995503]
Large language models (LLMs) provide faster inference, smaller memory footprints, and enables local deployment.
Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits.
Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy.
More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored.
arXiv Detail & Related papers (2023-12-01T22:27:12Z) - QuantEase: Optimization-based Quantization for Language Models [17.333778751252392]
This work introduces Quantization (PTQ) of various quantization layers from recent advances of Large Language Models (LLMs)
Our CD-based approach features straightforward updates, relying solely on vector operations.
We also explore an outlier approach, allowing for retaining significant weights (outoutliers) with complete precision.
arXiv Detail & Related papers (2023-09-05T01:39:09Z) - An Empirical Study of Low Precision Quantization for TinyML [8.939851623894334]
We focus on post-training quantization (PTQ) algorithms that quantize a model to low-bit (less than 8-bit) precision with only a small set of calibration data.
To achieve a fair comparison, we build a simulated quantization framework to investigate recent PTQ algorithms.
With ablation study on different alternatives of components in the pipeline, we reveal key design choices when performing low precision quantization.
arXiv Detail & Related papers (2022-03-10T17:22:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.