A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B
- URL: http://arxiv.org/abs/2409.11055v1
- Date: Tue, 17 Sep 2024 10:31:37 GMT
- Title: A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B
- Authors: Jemin Lee, Sihyeong Park, Jinse Kwon, Jihun Oh, Yongin Kwon,
- Abstract summary: This paper evaluates the performance of instruction-tuned LLMs on models ranging from 7B to 405B.
We assess performance across six task types: commonsense Q&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue.
- Score: 11.832907585157638
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Prior research works have evaluated quantized LLMs using limited metrics such as perplexity or a few basic knowledge tasks and old datasets. Additionally, recent large-scale models such as Llama 3.1 with up to 405B have not been thoroughly examined. This paper evaluates the performance of instruction-tuned LLMs across various quantization methods (GPTQ, AWQ, SmoothQuant, and FP8) on models ranging from 7B to 405B. Using 13 benchmarks, we assess performance across six task types: commonsense Q\&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. Our key findings reveal that (1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following; (2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models; (3) task difficulty does not significantly impact accuracy degradation due to quantization; and (4) the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.
Related papers
- Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling [69.57918638435491]
Test-Time Scaling is an important method for improving the performance of Large Language Models.
This paper focuses on two core questions: What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels?
We show that with our compute-optimal TTS strategy, extremely small policy models can outperform larger models.
arXiv Detail & Related papers (2025-02-10T17:30:23Z) - GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input.
GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z) - UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions [10.28688988951815]
UBENCH is a benchmark for evaluating large language models.
It includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities.
We also evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding.
arXiv Detail & Related papers (2024-06-18T16:50:38Z) - Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox [46.39670209441478]
Large language models (LLMs) have exhibited exciting progress in multiple scenarios.
As an effective means to reduce memory footprint and inference cost, quantization also faces challenges in performance degradation at low bit-widths.
This work provides a comprehensive benchmark suite for this research topic, including an evaluation system, detailed analyses, and a general toolbox.
arXiv Detail & Related papers (2024-06-15T12:02:14Z) - The Impact of Quantization on Retrieval-Augmented Generation: An Analysis of Small LLMs [2.6968321526169503]
Post-training quantization reduces the computational demand of Large Language Models (LLMs) but can weaken some of their capabilities.
This paper explores how quantization affects smaller LLMs' ability to perform retrieval-augmented generation (RAG)
Our findings reveal that if a 7B LLM performs the task well, quantization does not impair its performance and long-context reasoning capabilities.
arXiv Detail & Related papers (2024-06-10T08:23:52Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - Quantifying the Capabilities of LLMs across Scale and Precision [12.879551933541345]
This study investigates the effect of model scale and quantization on the performance of instruct models.
We show that larger models show exceptional resilience to precision reduction and can maintain high accuracy even at 4-bit quantization.
arXiv Detail & Related papers (2024-05-06T03:42:34Z) - An empirical study of LLaMA3 quantization: from LLMs to MLLMs [54.91212829143966]
The LLaMA family is one of the most powerful open-source large language models (LLMs)
LLaMA3 models have achieved impressive performance in various domains with super-large scale pre-training on over 15T tokens of data.
We evaluate the 10 existing post-training quantization and LoRA fine-tuning (LoRA-FT) methods of LLaMA3 on 1-8 bits and various datasets to reveal the low-bit quantization performance of LLaMA3.
arXiv Detail & Related papers (2024-04-22T10:03:03Z) - A Comprehensive Evaluation of Quantization Strategies for Large Language Models [42.03804933928227]
Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs.
Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular.
We propose a structured evaluation framework consisting of three critical dimensions: knowledge & capacity, (2) alignment, and (3) efficiency.
arXiv Detail & Related papers (2024-02-26T17:45:36Z) - BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS)
We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting.
Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z) - Do Emergent Abilities Exist in Quantized Large Language Models: An
Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models.
Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation.
To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.