Progressive Binarization with Semi-Structured Pruning for LLMs
- URL: http://arxiv.org/abs/2502.01705v2
- Date: Sat, 08 Feb 2025 02:23:05 GMT
- Title: Progressive Binarization with Semi-Structured Pruning for LLMs
- Authors: Xianglong Yan, Tianao Zhang, Zhiteng Li, Yulun Zhang,
- Abstract summary: Large language models (LLMs) have achieved remarkable success in natural language processing tasks.
Their high computational and memory demands pose challenges for deployment on resource-constrained devices.
We propose a Progressive Binarization with Semi-Structured Pruning (PBS$2$P) method for LLM compression.
- Score: 36.32239429974179
- License:
- Abstract: Large language models (LLMs) have achieved remarkable success in natural language processing tasks, but their high computational and memory demands pose challenges for deployment on resource-constrained devices. Binarization, as an efficient compression method that reduces model weights to just 1 bit, significantly lowers both computational and memory requirements. Despite this, the binarized LLM still contains redundancy, which can be further compressed. Semi-structured pruning provides a promising approach to achieve this, which offers a better trade-off between model performance and hardware efficiency. However, simply combining binarization with semi-structured pruning can lead to a significant performance drop. To address this issue, we propose a Progressive Binarization with Semi-Structured Pruning (PBS$^2$P) method for LLM compression. We first propose a Stepwise semi-structured Pruning with Binarization Optimization (SPBO). Our optimization strategy significantly reduces the total error caused by pruning and binarization, even below that of the no-pruning scenario. Furthermore, we design a Coarse-to-Fine Search (CFS) method to select pruning elements more effectively. Extensive experiments demonstrate that PBS$^2$P achieves superior accuracy across various LLM families and evaluation metrics, noticeably outperforming state-of-the-art (SOTA) binary PTQ methods. The code and models will be available at https://github.com/XIANGLONGYAN/PBS2P.
Related papers
- Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations.
Post-training pruning methods are proposed to prune LLMs in one-shot without retraining.
Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z) - SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching [32.4599581528901]
"Two-tower" architecture is used for compressing pre-trained LLM parameters into compact representations and fine-tuning the additive full-precision adapter.
We propose SpaLLM (Sketched Adapting of LLMs), a novel compressive adaptation approach for LLMs.
We show that SpaLLM sketches pre-trained LLM weights into lookup tables and directly fine-tunes the values in these tables.
arXiv Detail & Related papers (2024-10-08T20:58:24Z) - ARB-LLM: Alternating Refined Binarizations for Large Language Models [82.24826360906341]
ARB-LLM is a novel 1-bit post-training quantization (PTQ) technique tailored for Large Language Models (LLMs)
As a binary PTQ method, our ARB-LLM$_textRC$ is the first to surpass FP16 models of the same size.
arXiv Detail & Related papers (2024-10-04T03:50:10Z) - A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models [24.185245582500876]
We introduce FISTAPruner, the first post-training pruner based on convex optimization models and algorithms.
FISTAPruner incorporates an intra-layer cumulative error correction mechanism and supports parallel pruning.
We evaluate FISTAPruner on models such as OPT, LLaMA, LLaMA-2, and LLaMA-3 with 125M to 70B parameters under unstructured and 2:4 semi-structured sparsity.
arXiv Detail & Related papers (2024-08-07T12:33:46Z) - Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining [16.026565606764954]
We simplify the pruning process for Transformer-based large language models (LLMs)
We propose two inference-aware pruning criteria derived from the optimization perspective of output approximation.
We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining.
arXiv Detail & Related papers (2024-07-26T23:53:59Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models.
Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Just CHOP: Embarrassingly Simple LLM Compression [27.64461490974072]
Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint.
We show that simple layer pruning coupled with an extended language model pretraining produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale.
We also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.
arXiv Detail & Related papers (2023-05-24T08:18:35Z) - Advancing Model Pruning via Bi-level Optimization [89.88761425199598]
iterative magnitude pruning (IMP) is the predominant pruning method to successfully find 'winning tickets'
One-shot pruning methods have been developed, but these schemes are usually unable to find winning tickets as good as IMP.
We show that the proposed bi-level optimization-oriented pruning method (termed BiP) is a special class of BLO problems with a bi-linear problem structure.
arXiv Detail & Related papers (2022-10-08T19:19:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.