Related papers: Progressive Binarization with Semi-Structured Pruning for LLMs

Progressive Binarization with Semi-Structured Pruning for LLMs

URL: http://arxiv.org/abs/2502.01705v3
Date: Mon, 30 Jun 2025 05:16:02 GMT
Title: Progressive Binarization with Semi-Structured Pruning for LLMs
Authors: Xianglong Yan, Tianao Zhang, Zhiteng Li, Yulun Zhang,
Abstract summary: We propose Progressive Binarization with Semi-Structured Pruning (PBS$2$P), a novel post-training compression framework.<n>We show that PBS$2$P consistently outperforms state-of-the-art binary post-training quantization methods in both perplexity and downstream accuracy.
Score: 36.32239429974179
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have achieved remarkable progress in natural language processing, but their high computational and memory costs hinder deployment on resource-constrained devices. Binarization, which reduces model weights to 1 bit, is a promising solution for efficient inference. However, binarized LLMs still exhibit redundancy that can be further compressed. Semi-structured pruning offers a favorable trade-off between model performance and hardware efficiency, but naively combining it with binarization often leads to severe performance degradation. To address this, we propose Progressive Binarization with Semi-Structured Pruning (PBS$^2$P), a novel post-training compression framework. We propose Stepwise semi-structured Pruning with Binarization Optimization (SPBO) to jointly reduce pruning and binarization error. Additionally, we develop a Coarse-to-Fine Search (CFS) strategy to more effectively select pruning elements. Extensive experiments across multiple LLM families show that PBS$^2$P consistently outperforms state-of-the-art binary post-training quantization methods in both perplexity and downstream accuracy. The code and models will be available at: https://github.com/XIANGLONGYAN/PBS2P.

Related papers

Highly Efficient and Effective LLMs with Multi-Boolean Architectures [1.4195677954898822]
Weight binarization has emerged as a promising strategy to drastically reduce the complexity of large language models (LLMs)<n>We introduce a novel framework that effectively transforms LLMs into multi- kernel Boolean parameters, for the first time, finetunes them directly in the Boolean domain, eliminating the need for expensive latent weights.<n>Our method outperforms recent ultra low-bit quantization and binarization methods.
arXiv Detail & Related papers (2025-05-28T19:40:34Z)
SPAP: Structured Pruning via Alternating Optimization and Penalty Methods [2.1388885579612804]
Large language models (LLMs) are often constrained by their substantial computational and memory demands.<n>We propose SPAP (Structured Pruning via Alternating Optimization and Penalty Methods), a novel and efficient structured pruning framework for LLMs grounded in optimization theory.<n>Our work offers a practical, optimization-driven solution for pruning LLMs while preserving model performance.
arXiv Detail & Related papers (2025-05-06T09:47:53Z)
Bilevel ZOFO: Bridging Parameter-Efficient and Zeroth-Order Techniques for Efficient LLM Fine-Tuning and Meta-Training [44.48966200270378]
Fine-tuning pre-trained Large Language Models (LLMs) for downstream tasks using First-Order (FO)imats presents significant computational challenges. We propose a bilevel optimization framework that complements ZO methods with PEFT to mitigate sensitivity to hard prompts. Our Bilevel ZOFO method employs a double-loop optimization strategy, where only the gradient of the PEFT model and the forward pass of the base model are required.
arXiv Detail & Related papers (2025-02-05T20:47:44Z)
Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models [1.6385815610837167]
Pivoting Factorization (PIFA) is a novel low-rank representation that unsupervisedly learns a compact form of any low-rank representation.<n>PIFA achieves 24.2% additional memory savings and 24.6% faster inference over low-rank layers at rank = 50% of dimension.<n>MPIFA, combining M and PIFA into an end-to-end framework, significantly outperforms existing low-rank pruning methods.
arXiv Detail & Related papers (2025-01-31T12:36:31Z)
Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. Post-training pruning methods are proposed to prune LLMs in one-shot without retraining. Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z)
ARB-LLM: Alternating Refined Binarizations for Large Language Models [82.24826360906341]
ARB-LLM is a novel 1-bit post-training quantization (PTQ) technique tailored for Large Language Models (LLMs) As a binary PTQ method, our ARB-LLM$_textRC$ is the first to surpass FP16 models of the same size.
arXiv Detail & Related papers (2024-10-04T03:50:10Z)
A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models [24.185245582500876]
We introduce FISTAPruner, the first post-training pruner based on convex optimization models and algorithms. FISTAPruner incorporates an intra-layer cumulative error correction mechanism and supports parallel pruning. We evaluate FISTAPruner on models such as OPT, LLaMA, LLaMA-2, and LLaMA-3 with 125M to 70B parameters under unstructured and 2:4 semi-structured sparsity.
arXiv Detail & Related papers (2024-08-07T12:33:46Z)
Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining [16.026565606764954]
We simplify the pruning process for Transformer-based large language models (LLMs) We propose two inference-aware pruning criteria derived from the optimization perspective of output approximation. We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining.
arXiv Detail & Related papers (2024-07-26T23:53:59Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models. Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
Just CHOP: Embarrassingly Simple LLM Compression [27.64461490974072]
Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint. We show that simple layer pruning coupled with an extended language model pretraining produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale. We also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.
arXiv Detail & Related papers (2023-05-24T08:18:35Z)
Advancing Model Pruning via Bi-level Optimization [89.88761425199598]
iterative magnitude pruning (IMP) is the predominant pruning method to successfully find 'winning tickets' One-shot pruning methods have been developed, but these schemes are usually unable to find winning tickets as good as IMP. We show that the proposed bi-level optimization-oriented pruning method (termed BiP) is a special class of BLO problems with a bi-linear problem structure.
arXiv Detail & Related papers (2022-10-08T19:19:29Z)
Efficient Micro-Structured Weight Unification and Pruning for Neural Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices. Previous unstructured or structured weight pruning methods can hardly truly accelerate inference. We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.