Related papers: QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models

QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models

URL: http://arxiv.org/abs/2412.11629v1
Date: Mon, 16 Dec 2024 10:14:01 GMT
Title: QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models
Authors: Changhai Zhou, Yuhua Zhou, Shijie Han, Qian Qiao, Hongguang Li,
Abstract summary: Large language models (LLMs) have significantly advanced various natural language processing (NLP) tasks.<n> structured pruning is an effective approach to reducing model size, but it often results in significant accuracy degradation.<n>We introduce quantization into the structured pruning framework to reduce memory consumption during both fine-tuning and inference.<n>We propose QPruner, a novel framework that employs structured pruning to reduce model size, followed by a layer-wise mixed-precision quantization scheme.
Score: 3.093903491123962
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rise of large language models (LLMs) has significantly advanced various natural language processing (NLP) tasks. However, the resource demands of these models pose substantial challenges. Structured pruning is an effective approach to reducing model size, but it often results in significant accuracy degradation, necessitating parameter updates to adapt. Unfortunately, such fine-tuning requires substantial memory, which limits its applicability. To address these challenges, we introduce quantization into the structured pruning framework to reduce memory consumption during both fine-tuning and inference. However, the combined errors from pruning and quantization increase the difficulty of fine-tuning, requiring a more refined quantization scheme. To this end, we propose QPruner, a novel framework that employs structured pruning to reduce model size, followed by a layer-wise mixed-precision quantization scheme. Quantization precisions are assigned to each layer based on their importance to the target task, and Bayesian optimization is employed to refine precision allocation strategies, ensuring a balance between model accuracy and memory efficiency. Extensive experiments on benchmark datasets demonstrate that QPruner significantly outperforms existing methods in memory savings while maintaining or improving model performance.

Related papers

MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
QPART: Adaptive Model Quantization and Dynamic Workload Balancing for Accuracy-aware Edge Inference [10.55165549089585]
We argue that planning an inference pattern with a request-specific model tailored to the device's computational capacity is more cost-efficient and robust to diverse scenarios.<n>We propose an accuracy-aware and workload-balanced inference system that integrates joint model quantization and inference partitioning.<n> Simulation results demonstrate a substantial reduction in overall time and power consumption, with payloads decreasing by over 80% and accuracy degradation kept below 1%.
arXiv Detail & Related papers (2025-06-30T15:03:35Z)
Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment. We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
Sample-aware Adaptive Structured Pruning for Large Language Models [14.605017410864583]
This study introduces AdaPruner, a sample-aware adaptive structured pruning framework for large language models (LLMs) Specifically, AdaPruner effectively removes redundant parameters from LLMs by constructing a structured pruning solution space. At a 20% pruning ratio, the model pruned with AdaPruner maintains 97% of the performance of the unpruned model.
arXiv Detail & Related papers (2025-03-08T12:00:21Z)
Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models. High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size. We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z)
Post-Training Non-Uniform Quantization for Convolutional Neural Networks [0.0]
Quantization is a technique that aims to alleviate large storage requirements and speed up the inference process.<n>In this paper, we introduce a novel post-training quantization method for model weights.<n>Our method finds optimal clipping thresholds and scaling factors along with mathematical guarantees that our method minimizes quantization noise.
arXiv Detail & Related papers (2024-12-10T10:33:58Z)
Q-VLM: Post-training Quantization for Large Vision-Language Models [73.19871905102545]
We propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. We mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy. Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation.
arXiv Detail & Related papers (2024-10-10T17:02:48Z)
Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling. Our research explores task-specific model pruning to inform decisions about designing SMoE architectures. We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z)
Temporal Feature Matters: A Framework for Diffusion Model Quantization [105.3033493564844]
Diffusion models rely on the time-step for the multi-round denoising. We introduce a novel quantization framework that includes three strategies. This framework preserves most of the temporal information and ensures high-quality end-to-end generation.
arXiv Detail & Related papers (2024-07-28T17:46:15Z)
ApiQ: Finetuning of 2-Bit Quantized Large Language Model [12.328293460903911]
ApiQ is designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs. It consistently achieves superior finetuning results across various bit-widths.
arXiv Detail & Related papers (2024-02-07T09:36:54Z)
Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision Post-Training Quantization [7.392278887917975]
We propose a mixed-precision post training quantization approach that assigns different numerical precisions to tensors in a network based on their specific needs. Our experiments demonstrate latency reductions compared to a 16-bit baseline of $25.48%$, $21.69%$, and $33.28%$ respectively.
arXiv Detail & Related papers (2023-06-08T02:18:58Z)
AMED: Automatic Mixed-Precision Quantization for Edge Devices [3.5223695602582614]
Quantized neural networks are well known for reducing the latency, power consumption, and model size without significant harm to the performance. Mixed-precision quantization offers better utilization of customized hardware that supports arithmetic operations at different bitwidths.
arXiv Detail & Related papers (2022-05-30T21:23:22Z)
Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks. These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices. We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.