Related papers: Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective

Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective

URL: http://arxiv.org/abs/2508.16712v1
Date: Fri, 22 Aug 2025 14:59:23 GMT
Title: Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective
Authors: Tianyao Shi, Yi Ding,
Abstract summary: Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains.<n>Their heavy resource demands make quantization-reducing precision to lower-bit formats-critical for efficient serving.<n>We first develop a fully automated online characterization framework qMeter, and then conduct an in-depth characterization of 11 post-training LLM quantization methods.
Score: 5.1094466593178325
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their heavy resource demands make quantization-reducing precision to lower-bit formats-critical for efficient serving. While many quantization methods exist, a systematic understanding of their performance, energy, and quality tradeoffs in realistic serving conditions remains a gap. In this work, we first develop a fully automated online characterization framework qMeter, and then conduct an in-depth characterization of 11 post-training LLM quantization methods across 4 model sizes (7B-70B) and two GPU architectures (A100, H100). We evaluate quantization at the application, workload, parallelism, and hardware levels under online serving conditions. Our study reveals highly task- and method-dependent tradeoffs, strong sensitivity to workload characteristics, and complex interactions with parallelism and GPU architecture. We further present three optimization case studies illustrating deployment challenges in capacity planning, energy-efficient scheduling, and multi-objective tuning. To the best of our knowledge, this is one of the first comprehensive application-, system-, and hardware-level characterization of LLM quantization from a joint performance, energy, and quality perspective.

Related papers

Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z)
Is Quantization a Deal-breaker? Empirical Insights from Large Code Models [7.182449176083625]
We apply Activation-aware Weight Quantization (AWQ) to two widely used code models, CodeLlama and DeepSeekCoder, to generate Java and Python code.<n>Our findings reveal that quantization is a robust technique that not only preserves functional correctness, but also retains key qualitative code attributes sought after by developers.
arXiv Detail & Related papers (2025-07-13T14:58:19Z)
Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency [6.306413686006502]
We conduct a comprehensive analysis of 28 quantized Large Language Models (LLMs) from the Ollama library.<n>We evaluate energy efficiency, inference performance, and output accuracy across multiple quantization levels and task types.<n>Our findings reveal the trade-offs between energy efficiency, inference speed, and accuracy in different quantization settings.
arXiv Detail & Related papers (2025-04-04T11:29:30Z)
Large Language Model as Meta-Surrogate for Data-Driven Many-Task Optimization: A Proof-of-Principle Study [11.452011929848844]
This study proposes a novel meta-surrogate framework to assist many-task optimization.<n>We formulate a unified framework for many-task fitness prediction, by defining a universal model with metadata to fit a group of problems.<n>Our framework supports dual-level knowledge transfer -- at both the surrogate and individual levels -- enhancing optimization efficiency and robustness.
arXiv Detail & Related papers (2025-03-11T11:13:11Z)
Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework [81.29965270493238]
We develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) for wireless communication applications.<n>The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard.<n>We introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data.
arXiv Detail & Related papers (2025-01-16T16:19:53Z)
A Survey on Inference Optimization Techniques for Mixture of Experts Models [50.40325411764262]
Large-scale Mixture of Experts (MoE) models offer enhanced model capacity and computational efficiency through conditional computation.<n> deploying and running inference on these models presents significant challenges in computational resources, latency, and energy efficiency.<n>This survey analyzes optimization techniques for MoE models across the entire system stack.
arXiv Detail & Related papers (2024-12-18T14:11:15Z)
Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview [4.166341398835636]
We discuss the necessity and impact of model size growth, highlighting the performance benefits as well as the computational challenges and environmental considerations. We delve into various quantization techniques, including both post-training quantization (PTQ) and quantization-aware training (QAT) We examine how these methods address issues like outliers, importance weighting, and activation quantization, ultimately contributing to more sustainable and accessible deployment of large-scale models.
arXiv Detail & Related papers (2024-09-18T02:35:00Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
Quantum Computing Enhanced Service Ecosystem for Simulation in Manufacturing [56.61654656648898]
We propose a framework for a quantum computing-enhanced service ecosystem for simulation in manufacturing. We analyse two high-value use cases with the aim of a quantitative evaluation of these new computing paradigms for industrially-relevant settings.
arXiv Detail & Related papers (2024-01-19T11:04:14Z)
QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources [35.16907522675046]
Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks.<n>Fine-tuning these pretrained models on downstream datasets provides significant performance gains.<n>This process typically requires a large number of expensive, high-end GPU.<n>We propose QFT, a Quantized Full- parameter Tuning framework that quantizes and stores all training states.
arXiv Detail & Related papers (2023-10-11T02:47:40Z)
Potential and limitations of quantum extreme learning machines [55.41644538483948]
We present a framework to model QRCs and QELMs, showing that they can be concisely described via single effective measurements. Our analysis paves the way to a more thorough understanding of the capabilities and limitations of both QELMs and QRCs.
arXiv Detail & Related papers (2022-10-03T09:32:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.