SPQ: An Ensemble Technique for Large Language Model Compression
- URL: http://arxiv.org/abs/2602.18420v1
- Date: Fri, 20 Feb 2026 18:44:16 GMT
- Title: SPQ: An Ensemble Technique for Large Language Model Compression
- Authors: Jiamin Yao, Eren Gultepe,
- Abstract summary: SPQ (SVD-Pruning-Quantization) is an ensemble technique for large language model LLM compression.<n>It achieves up to 75% memory reduction while maintaining or improving perplexity.<n>It improves inference over GPTQ, achieving up to a 1.9x speedup.
- Score: 1.2891210250935148
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. Each component targets a different source of inefficiency: i) pruning removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit quantization uniformly compresses all linear layers. At matched compression ratios, SPQ outperforms individual methods (SVD-only, pruning-only, or quantization-only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K. Compared to strong baselines like GPTQ and SparseGPT, SPQ offers competitive perplexity and accuracy while using less memory (6.86 GB vs. 7.16 GB for GPTQ). Moreover, SPQ improves inference throughput over GPTQ, achieving up to a 1.9x speedup, which further enhances its practicality for real-world deployment. The effectiveness of SPQ's robust compression through layer-aware and complementary compression techniques may provide practical deployment of LLMs in memory-constrained environments. Code is available at: https://github.com/JiaminYao/SPQ_LLM_Compression/
Related papers
- COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression [5.280540253822294]
Post-training compression of Transformer models commonly relies on truncated singular value decomposition (SVD)<n>We propose COMPOT, a training-free compression framework that uses a small calibration dataset to estimate a sparse weight factorization.<n> COMPOT consistently delivers a superior quality-compression trade-off over strong low-rank and sparse baselines.
arXiv Detail & Related papers (2026-02-16T21:31:34Z) - BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models [56.504879072674015]
We propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients.<n>BPDQ enables serving Qwen2.5-72B on a single GTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit)
arXiv Detail & Related papers (2026-02-04T02:54:37Z) - HAS-VQ: Hessian-Adaptive Sparse Vector Quantization for High-Fidelity LLM Compression [0.0]
We introduce HAS-VQ (Hessian-Adaptive Sparse Vec- tor Quantization), a compression framework that strictly decouples high-sensitivity outliers from the bulk weight distribution.<n>We evaluate HAS-VQ on SmolLM2-1.7B, demonstrating two distinct regimes of superiority.
arXiv Detail & Related papers (2026-01-11T15:35:10Z) - Compressing Many-Shots in In-Context Learning [61.231471139896506]
We study an approach to improve the memory and computational efficiency of ICL inference by compressing the many-shot prompts.<n>We first show that existing prompt compression methods are ineffective for many-shot compression.<n>We propose MemCom, a layer-wise compression method.
arXiv Detail & Related papers (2025-10-17T16:57:42Z) - IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs [4.655407920049974]
Mixed-precision quantization offers a compelling solution, but existing methods struggle when the average precision drops below four bits.<n>In this paper, we propose two innovations to address these limitations.<n>First, we frame the mixed-precision quantization problem as a cooperative game among layers and introduce Shapley-based Progressive Quantization Estimation (SPQE)<n>Second, we propose Interaction-aware Mixed-Precision Quantization (IMPQ) which translates these Shapley estimates into a binary quadratic optimization formulation.
arXiv Detail & Related papers (2025-09-18T21:59:40Z) - GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference [5.358880304650258]
Group Quantization and Sparse Acceleration (GQSA) is a novel compression technique tailored for LLMs.<n>We propose a two-stage sparse optimization strategy that ensures the performance superiority of the compressed model.<n> Experimental results demonstrate that, under the GQSA W4S50% compression setting, the model's accuracy surpasses that of both 2:4 pruning and W2 quantization.
arXiv Detail & Related papers (2024-12-23T13:28:15Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - Memory-Efficient Fine-Tuning of Compressed Large Language Models via
sub-4-bit Integer Quantization [27.79783067245817]
Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs.
This paper presents Efficient Adaptation and Quantization-aware (PEQA) - a simple yet effective method that combines the advantages of PEFT with quantized LLMs.
arXiv Detail & Related papers (2023-05-23T15:20:01Z) - CSMPQ:Class Separability Based Mixed-Precision Quantization [9.005098065862411]
A novel mixed-precision quantization method, termed CSMPQ, is proposed.
Specifically, the TF-IDF metric that is widely used in natural language processing (NLP) is introduced to measure the class separability of layer-wise feature maps.
Without any iterative process, the proposed CSMPQ achieves better compression trade-offs than the state-of-the-art quantization methods.
arXiv Detail & Related papers (2022-12-20T12:52:19Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - Kernel Quantization for Efficient Network Compression [59.55192551370948]
Kernel Quantization (KQ) aims to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss.
Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level.
Experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer.
arXiv Detail & Related papers (2020-03-11T08:00:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.