Do Emergent Abilities Exist in Quantized Large Language Models: An
Empirical Study
- URL: http://arxiv.org/abs/2307.08072v2
- Date: Wed, 26 Jul 2023 04:15:48 GMT
- Title: Do Emergent Abilities Exist in Quantized Large Language Models: An
Empirical Study
- Authors: Peiyu Liu, Zikang Liu, Ze-Feng Gao, Dawei Gao, Wayne Xin Zhao, Yaliang
Li, Bolin Ding, Ji-Rong Wen
- Abstract summary: This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models.
Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation.
To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
- Score: 90.34226812493083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the superior performance, Large Language Models~(LLMs) require
significant computational resources for deployment and use. To overcome this
issue, quantization methods have been widely applied to reduce the memory
footprint of LLMs as well as increasing the inference rate. However, a major
challenge is that low-bit quantization methods often lead to performance
degradation. It is important to understand how quantization impacts the
capacity of LLMs. Different from previous studies focused on overall
performance, this work aims to investigate the impact of quantization on
\emph{emergent abilities}, which are important characteristics that distinguish
LLMs from small language models. Specially, we examine the abilities of
in-context learning, chain-of-thought reasoning, and instruction-following in
quantized LLMs. Our empirical experiments show that these emergent abilities
still exist in 4-bit quantization models, while 2-bit models encounter severe
performance degradation on the test of these abilities. To improve the
performance of low-bit models, we conduct two special experiments: (1)
fine-gained impact analysis that studies which components (or substructures)
are more sensitive to quantization, and (2) performance compensation through
model fine-tuning. Our work derives a series of important findings to
understand the impact of quantization on emergent abilities, and sheds lights
on the possibilities of extremely low-bit quantization for LLMs.
Related papers
- Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [17.43650511873449]
Large Language Models (LLMs) showcase remarkable performance and robust deductive capabilities, yet their expansive size complicates deployment and raises environmental concerns due to substantial resource consumption.
We have developed innovative methods that enhance the performance of quantized LLMs, particularly in low-bit settings.
Our methods consistently deliver state-of-the-art results across various quantization scenarios and offer deep theoretical insights into the quantization process, elucidating the potential of quantized models for widespread application.
arXiv Detail & Related papers (2024-07-22T09:45:16Z) - How Does Quantization Affect Multilingual LLMs? [50.867324914368524]
Quantization techniques are widely used to improve inference speed and deployment of large language models.
We conduct a thorough analysis of quantized multilingual LLMs, focusing on their performance across languages and at varying scales.
arXiv Detail & Related papers (2024-07-03T15:39:40Z) - When Quantization Affects Confidence of Large Language Models? [4.338589334157708]
We show that GPTQ to 4-bit results in a decrease in confidence regarding true labels, with varying impacts observed among different language models.
We propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.
arXiv Detail & Related papers (2024-05-01T16:58:28Z) - What Makes Quantization for Large Language Models Hard? An Empirical
Study from the Lens of Perturbation [55.153595212571375]
Quantization is a technique for improving the memory and computational efficiency of large language models (LLMs)
We propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs.
We conduct experiments with various artificial perturbations to explore their impact on LLM performance.
arXiv Detail & Related papers (2024-03-11T03:42:51Z) - A Comprehensive Evaluation of Quantization Strategies for Large Language Models [42.03804933928227]
Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs.
Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular.
We propose a structured evaluation framework consisting of three critical dimensions: knowledge & capacity, (2) alignment, and (3) efficiency.
arXiv Detail & Related papers (2024-02-26T17:45:36Z) - ApiQ: Finetuning of 2-Bit Quantized Large Language Model [12.328293460903911]
ApiQ is designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs.
It consistently achieves superior finetuning results across various bit-widths.
arXiv Detail & Related papers (2024-02-07T09:36:54Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - Beyond Task Performance: Evaluating and Reducing the Flaws of Large
Multimodal Models with In-Context Learning [105.77733287326308]
We evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following.
We explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations.
Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL.
arXiv Detail & Related papers (2023-10-01T12:02:59Z) - Neural Networks with Quantization Constraints [111.42313650830248]
We present a constrained learning approach to quantization training.
We show that the resulting problem is strongly dual and does away with gradient estimations.
We demonstrate that the proposed approach exhibits competitive performance in image classification tasks.
arXiv Detail & Related papers (2022-10-27T17:12:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.