Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis
- URL: http://arxiv.org/abs/2502.13178v3
- Date: Sun, 30 Mar 2025 06:18:35 GMT
- Title: Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis
- Authors: Jiaqi Zhao, Ming Wang, Miao Zhang, Yuzhang Shang, Xuebo Liu, Yaowei Wang, Min Zhang, Liqiang Nie,
- Abstract summary: Post-training Quantization (PTQ) technique has been extensively adopted for large language models (LLMs) compression.<n>Existing algorithms focus primarily on performance, overlooking the trade-off among model size, performance, and quantization bitwidth.<n>We provide a novel benchmark for LLMs PTQ in this paper.
- Score: 89.60263788590893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Post-training Quantization (PTQ) technique has been extensively adopted for large language models (LLMs) compression owing to its efficiency and low resource requirement. However, current research lacks a in-depth analysis of the superior and applicable scenarios of each PTQ strategy. In addition, existing algorithms focus primarily on performance, overlooking the trade-off among model size, performance, and quantization bitwidth. To mitigate these confusions, we provide a novel benchmark for LLMs PTQ in this paper. Firstly, in order to support our benchmark, we propose a comprehensive taxonomy for existing mainstream methods by scrutinizing their computational strategies (e.g., optimization-based, compensation-based, etc.). Then, we conduct extensive experiments with the baseline within each class, covering models with various sizes (7B-70B), bitwidths, training levels (LLaMA1/2/3/3.1), architectures (Mixtral, DeepSeekMoE and Mamba) and modality (LLaVA1.5 and VILA1.5) on a wide range of evaluation metrics.Through comparative analysis on the results, we summarize the superior of each PTQ strategy and modelsize-bitwidth trade-off considering the performance. For example, our benchmark reveals that compensation-based technique demonstrates outstanding cross-architecture robustness and extremely low-bit PTQ for ultra large models should be reexamined. Finally, we further accordingly claim that a practical combination of compensation and other PTQ strategy can achieve SOTA various robustness. We believe that our benchmark will provide valuable recommendations for the deployment of LLMs and future research on PTQ approaches.We conduct an repository for our benchmark at https://github.com/zjq0455/PTQ_Benchmark.
Related papers
- MT-RewardTree: A Comprehensive Framework for Advancing LLM-Based Machine Translation via Reward Modeling [7.980524378201173]
Process reward models (PRMs) have shown success in complex reasoning tasks for large language models (LLMs)
However, their application to machine translation (MT) remains underexplored due to the lack of systematic methodologies and evaluation benchmarks.
We introduce textbfMT-RewardTree, a comprehensive framework for constructing, evaluating, and deploying process reward models in MT.
arXiv Detail & Related papers (2025-03-15T13:04:51Z) - RSQ: Learning from Important Tokens Leads to Better Quantized LLMs [65.5558181902098]
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining.
We propose RSQ (Rotate, Scale, then Quantize), which applies rotations to the model to mitigate outliers.
We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families.
arXiv Detail & Related papers (2025-03-03T18:46:33Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [95.32315448601241]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models [13.709080134204326]
Post-Training Quantization (PTQ) is pivotal for deploying large language models within resource-limited settings.<n>We introduce a Mixed-precision Graph Neural PTQ (MG-PTQ) approach, employing a graph neural network (GNN) module to capture dependencies among weights.<n>Our method more effectively captures dependencies among target weights, leading to a more accurate assessment of weight importance.
arXiv Detail & Related papers (2025-01-30T05:39:01Z) - Revisiting BPR: A Replicability Study of a Common Recommender System Baseline [78.00363373925758]
We study the features of the BPR model, indicating their impact on its performance, and investigate open-source BPR implementations.
Our analysis reveals inconsistencies between these implementations and the original BPR paper, leading to a significant decrease in performance of up to 50% for specific implementations.
We show that the BPR model can achieve performance levels close to state-of-the-art methods on the top-n recommendation tasks and even outperform them on specific datasets.
arXiv Detail & Related papers (2024-09-21T18:39:53Z) - PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression [31.30170080420504]
State-of-the-art quantization methods include fine-tuning (part of) the compressed parameters over a limited amount of calibration data.
We propose PV-Tuning - a representation-agnostic framework that generalizes and improves upon existing fine-tuning strategies.
arXiv Detail & Related papers (2024-05-23T17:57:04Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - How to Prune Your Language Model: Recovering Accuracy on the "Sparsity
May Cry'' Benchmark [60.72725673114168]
We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets.
We propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark.
arXiv Detail & Related papers (2023-12-21T03:11:30Z) - A Model-Based Machine Learning Approach for Assessing the Performance of
Blockchain Applications [0.0]
We use machine learning (ML) model-based methods to predict blockchain performance.
We employ the salp swarm optimization (SO) ML model which enables the investigation of optimal blockchain configurations.
The $k$NN model outperforms SVM by 5% and the ISO also demonstrates a reduction of 4% inaccuracy deviation compared to regular SO.
arXiv Detail & Related papers (2023-09-20T10:39:21Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Is One Epoch All You Need For Multi-Fidelity Hyperparameter
Optimization? [17.21160278797221]
Multi-fidelity HPO (MF-HPO) leverages intermediate accuracy levels in the learning process and discards low-performing models early on.
We compared various representative MF-HPO methods against a simple baseline on classical benchmark data.
This baseline achieved similar results to its counterparts, while requiring an order of magnitude less computation.
arXiv Detail & Related papers (2023-07-28T09:14:41Z) - Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting [65.00288634420812]
Pairwise Ranking Prompting (PRP) is a technique to significantly reduce the burden on Large Language Models (LLMs)
Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs.
arXiv Detail & Related papers (2023-06-30T11:32:25Z) - An Empirical Study of Pre-trained Language Models in Simple Knowledge
Graph Question Answering [28.31377197194905]
Large-scale pre-trained language models (PLMs) have recently achieved great success and become a milestone in natural language processing (NLP)
In recent works on knowledge graph question answering (KGQA), BERT or its variants have become necessary in their KGQA models.
We compare the performance of different PLMs in KGQA and present three benchmarks for larger-scale KGs.
arXiv Detail & Related papers (2023-03-18T08:57:09Z) - Generalized Parametric Contrastive Learning [60.62901294843829]
Generalized Parametric Contrastive Learning (GPaCo/PaCo) works well on both imbalanced and balanced data.
Experiments on long-tailed benchmarks manifest the new state-of-the-art for long-tailed recognition.
arXiv Detail & Related papers (2022-09-26T03:49:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.