Related papers: UniComp: A Unified Evaluation of Large Language Model Compression via Pruning, Quantization and Distillation

UniComp: A Unified Evaluation of Large Language Model Compression via Pruning, Quantization and Distillation

URL: http://arxiv.org/abs/2602.09130v2
Date: Wed, 11 Feb 2026 09:09:33 GMT
Title: UniComp: A Unified Evaluation of Large Language Model Compression via Pruning, Quantization and Distillation
Authors: Jonathan von Rad, Yong Cao, Andreas Geiger,
Abstract summary: We introduce UniComp, a unified evaluation framework for comparing pruning, quantization, and knowledge distillation.<n>UniComp evaluates compressed models along three dimensions: performance, reliability, and efficiency.
Score: 23.560232846931456
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Model compression is increasingly essential for deploying large language models (LLMs), yet existing evaluations are limited in method coverage and focus primarily on knowledge-centric benchmarks. Thus, we introduce UniComp, a unified evaluation framework for comparing pruning, quantization, and knowledge distillation. UniComp evaluates compressed models along three dimensions: performance, reliability, and efficiency, using a diverse set of capability- and safety-oriented benchmarks together with a hardware-aware efficiency analysis. Through extensive evaluation of six compression techniques on modern LLMs across more than 40 datasets, we find that (i) compression exhibits a consistent knowledge bias, where knowledge-intensive tasks are relatively preserved while reasoning, multilingual, and instruction-following capabilities degrade substantially; (ii) quantization provides the best overall trade-off between retained performance and efficiency, whereas distillation yields strong runtime acceleration gains at high computational cost; and (iii) task-specific calibration can significantly improve the reasoning ability of pruned models by up to 50%.

Related papers

KARL: Knowledge Agents via Reinforcement Learning [63.627906947205624]
We present a system for training enterprise search agents via reinforcement learning.<n> KARLBench is a multi-capability evaluation suite spanning six distinct search regimes.<n>We show that models trained across heterogeneous search behaviors generalize substantially better than those optimized for any single benchmark.
arXiv Detail & Related papers (2026-03-05T14:30:25Z)
Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression [55.63153956934198]
Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs)<n>Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios.<n>We propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy.
arXiv Detail & Related papers (2026-02-09T06:57:15Z)
A Systematic Study of Compression Ordering for Large Language Models [0.5926203312586109]
This study systematically examines how knowledge distillation, structured pruning, and low-bit quantization perform when applied to the Qwen2.5 3B model.<n>Experiments show that quantization provides the greatest standalone compression, while pruning introduces moderate quality degradation.
arXiv Detail & Related papers (2025-11-23T12:46:56Z)
Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression [53.39128997308138]
We introduce information capacity, a measure of model efficiency based on text compression performance.<n> Empirical evaluations on mainstream open-source models show that models of varying sizes within a series exhibit consistent information capacity.<n>A distinctive feature of information capacity is that it incorporates tokenizer efficiency, which affects both input and output token counts.
arXiv Detail & Related papers (2025-11-11T10:07:32Z)
EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z)
Feature Alignment-Based Knowledge Distillation for Efficient Compression of Large Language Models [4.737806982257592]
This study proposes a knowledge distillation algorithm based on large language models and feature alignment.<n>The proposed model performs very close to the state-of-the-art GPT-4 model in terms of evaluation indicators such as perplexity, BLEU, ROUGE, and CER.
arXiv Detail & Related papers (2024-12-27T04:37:06Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression [109.23761449840222]
This study conducts the first, thorough evaluation of leading Large Language Models (LLMs) We find that quantization is currently a more effective approach than pruning in achieving efficiency and trustworthiness simultaneously.
arXiv Detail & Related papers (2024-03-18T01:38:19Z)
L3 Ensembles: Lifelong Learning Approach for Ensemble of Foundational Language Models [15.726224465017596]
We propose an approach that focuses on extracting meaningful representations from unseen data and constructing a structured knowledge base. We conducted experiments on various NLP tasks to validate its effectiveness, including benchmarks like GLUE and SuperGLUE. The proposed L3 ensemble method increases the model accuracy by 4% 36% compared to the fine-tuned FLM.
arXiv Detail & Related papers (2023-11-11T06:59:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.