Related papers: A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

Related papers

SiLQ: Simple Large Language Model Quantization-Aware Training [3.09578981466695]
Large language models can be quantized to reduce inference time latency, model size, and energy consumption.<n>A challenge exists to deliver quantized models with minimal loss of accuracy in reasonable time.<n>Here, we demonstrate a simple, end-to-end quantization-aware training approach that outperforms the leading published quantization methods.
arXiv Detail & Related papers (2025-07-22T18:17:53Z)
InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models [39.257022875813284]
Large Language Models (LLMs) have demonstrated impressive performance on complex reasoning benchmarks such as GSM8K, MATH, and AIME.<n>Model quantization has emerged as a promising approach to reduce memory footprint and inference latency.<n>We show that quantization can degrade mathematical reasoning accuracy by up to 69.81%.
arXiv Detail & Related papers (2025-05-16T12:11:40Z)
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math [135.1260782461186]
Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) However, improving reasoning in Small Language Models (SLMs) remains challenging due to their limited model capacity. We present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward.
arXiv Detail & Related papers (2025-04-30T00:04:35Z)
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models [65.39456695678713]
We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists. We find that in general, reasoning models are poorly calibrated, particularly on easy problems. We introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.
arXiv Detail & Related papers (2025-04-17T22:16:30Z)
Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset. We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard) We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z)
Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models [48.98109982725689]
We conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths. We identify model size, model origin, and task difficulty as critical determinants of performance.
arXiv Detail & Related papers (2025-04-07T08:22:45Z)
Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs. This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z)
Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning [29.687113675756127]
Large language models have achieved significant advancements in complex mathematical reasoning benchmarks, such as MATH. Model quantization has emerged as an effective strategy to reduce memory usage and computational costs by employing lower precision and bit-width representations.
arXiv Detail & Related papers (2025-01-06T14:23:02Z)
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions [10.28688988951815]
UBENCH is a benchmark for evaluating large language models. It includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities. We also evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding.
arXiv Detail & Related papers (2024-06-18T16:50:38Z)
Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox [46.39670209441478]
Large language models (LLMs) have exhibited exciting progress in multiple scenarios. As an effective means to reduce memory footprint and inference cost, quantization also faces challenges in performance degradation at low bit-widths. This work provides a comprehensive benchmark suite for this research topic, including an evaluation system, detailed analyses, and a general toolbox.
arXiv Detail & Related papers (2024-06-15T12:02:14Z)
The Impact of Quantization on Retrieval-Augmented Generation: An Analysis of Small LLMs [2.6968321526169503]
Post-training quantization reduces the computational demand of Large Language Models (LLMs) but can weaken some of their capabilities. This paper explores how quantization affects smaller LLMs' ability to perform retrieval-augmented generation (RAG) Our findings reveal that if a 7B LLM performs the task well, quantization does not impair its performance and long-context reasoning capabilities.
arXiv Detail & Related papers (2024-06-10T08:23:52Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs) Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths. This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
Quantifying the Capabilities of LLMs across Scale and Precision [12.879551933541345]
This study investigates the effect of model scale and quantization on the performance of instruct models. We show that larger models show exceptional resilience to precision reduction and can maintain high accuracy even at 4-bit quantization.
arXiv Detail & Related papers (2024-05-06T03:42:34Z)
An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs [54.91212829143966]
This study explores LLaMA3's capabilities when quantized to low bit-width. We evaluate 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets. Our experimental results indicate that LLaMA3 still suffers non-negligent degradation in linguistic and visual contexts.
arXiv Detail & Related papers (2024-04-22T10:03:03Z)
A Comprehensive Evaluation of Quantization Strategies for Large Language Models [42.03804933928227]
Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs. Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular. We propose a structured evaluation framework consisting of three critical dimensions: knowledge & capacity, (2) alignment, and (3) efficiency.
arXiv Detail & Related papers (2024-02-26T17:45:36Z)
AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models [0.18416014644193066]
AWEQ excels in both ultra-low-bit quantization and 8-bit weight and activation (W8A8) quantization. We have further refined the equalization method to mitigate quantization bias error, ensuring the robustness of the model.
arXiv Detail & Related papers (2023-11-02T15:18:22Z)
BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS) We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting. Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z)
Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks. However, the massive size of these models poses huge challenges for their deployment in real-world applications. We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z)
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models. Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation. To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z)
PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance [114.1541203743303]
We propose PLATON, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation. We conduct extensive experiments with several Transformer-based models on natural language understanding, question answering and image classification.
arXiv Detail & Related papers (2022-06-25T05:38:39Z)
Zero-shot Adversarial Quantization [11.722728148523366]
We propose a zero-shot adversarial quantization (ZAQ) framework, facilitating effective discrepancy estimation and knowledge transfer. This is achieved by a novel two-level discrepancy modeling to drive a generator to synthesize informative and diverse data examples. We conduct extensive experiments on three fundamental vision tasks, demonstrating the superiority of ZAQ over the strong zero-shot baselines.
arXiv Detail & Related papers (2021-03-29T01:33:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.