Progressive Binarization with Semi-Structured Pruning for LLMs
- URL: http://arxiv.org/abs/2502.01705v4
- Date: Sat, 27 Sep 2025 03:38:26 GMT
- Title: Progressive Binarization with Semi-Structured Pruning for LLMs
- Authors: Xianglong Yan, Tianao Zhang, Zhiteng Li, Haotong Qin, Yulun Zhang,
- Abstract summary: We propose Progressive Binarization with Semi-Structured Pruning (PBS$2$P), a novel post-training framework that seamlessly integrates binarization and semi-structured pruning.<n>We show that PBS$2$P consistently outperforms state-of-the-art (SOTA) binary post-training quantization methods in both perplexity and downstream accuracy.
- Score: 36.91249209658632
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have achieved remarkable progress in natural language processing, but their high computational and memory costs hinder deployment on resource-constrained devices. Binarization represents the most extreme form of quantization, yet binarized models still contain redundancy that can be further removed. Pruning provides a natural way to eliminate such redundancy, but na\"ive combination with binarization often results in severe performance degradation. In this paper, we propose Progressive Binarization with Semi-Structured Pruning (PBS$^2$P), a novel post-training framework that seamlessly integrates binarization and semi-structured pruning. We first propose Stepwise semi-structured Pruning with Binarization Optimization (SPBO), which progressively introduces sparsity while optimizing binarization parameters to jointly reduce pruning and quantization error, yielding more stable and accurate compression. Additionally, we propose a Coarse-to-Fine Search (CFS) that first allocates pruning ratios and then refines element selection, further enhancing overall performance. Extensive experiments across multiple LLM families show that PBS$^2$P consistently outperforms state-of-the-art (SOTA) binary post-training quantization methods in both perplexity and downstream accuracy. The code and models will be available at https://github.com/XIANGLONGYAN/PBS2P.
Related papers
- RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs [5.782015253162346]
Residual binarization enables matmul-free inference by stacking binary layers.<n>We propose RaBiT, a novel quantization framework that resolves coadaptation by algorithmically enforcing a residual hierarchy.<n>RaBiT achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a $4.49times$ inference speed-up over full-precision models.
arXiv Detail & Related papers (2026-02-05T06:41:11Z) - RS-ORT: A Reduced-Space Branch-and-Bound Algorithm for Optimal Regression Trees [2.612627266839037]
Mixed-integer programming (MIP) has emerged as a powerful framework for learning optimal decision trees.<n>Naively binarizing continuous features sacrifices global optimality and often yields needlessly deep trees.<n>We recast the optimal regression-tree training as a two-stage optimization problem and propose Reduced-Space Optimal Regression Trees (RS-ORT)<n> RS-ORT is a specialized branch-and-bound (BB) algorithm that branches exclusively on tree-structural variables.
arXiv Detail & Related papers (2025-10-27T22:17:09Z) - Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning [71.30276778807068]
We propose a unified framework that strategically coordinates sample pruning and token pruning.<n>Q-Tuning achieves a +38% average improvement over the full-data SFT baseline using only 12.5% of the original training data.
arXiv Detail & Related papers (2025-09-28T13:27:38Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z) - Highly Efficient and Effective LLMs with Multi-Boolean Architectures [1.4195677954898822]
Weight binarization has emerged as a promising strategy to drastically reduce the complexity of large language models (LLMs)<n>We introduce a novel framework that effectively transforms LLMs into multi- kernel Boolean parameters, for the first time, finetunes them directly in the Boolean domain, eliminating the need for expensive latent weights.<n>Our method outperforms recent ultra low-bit quantization and binarization methods.
arXiv Detail & Related papers (2025-05-28T19:40:34Z) - SPAP: Structured Pruning via Alternating Optimization and Penalty Methods [2.1388885579612804]
Large language models (LLMs) are often constrained by their substantial computational and memory demands.<n>We propose SPAP (Structured Pruning via Alternating Optimization and Penalty Methods), a novel and efficient structured pruning framework for LLMs grounded in optimization theory.<n>Our work offers a practical, optimization-driven solution for pruning LLMs while preserving model performance.
arXiv Detail & Related papers (2025-05-06T09:47:53Z) - Bilevel ZOFO: Bridging Parameter-Efficient and Zeroth-Order Techniques for Efficient LLM Fine-Tuning and Meta-Training [44.48966200270378]
Fine-tuning pre-trained Large Language Models (LLMs) for downstream tasks using First-Order (FO)imats presents significant computational challenges.
We propose a bilevel optimization framework that complements ZO methods with PEFT to mitigate sensitivity to hard prompts.
Our Bilevel ZOFO method employs a double-loop optimization strategy, where only the gradient of the PEFT model and the forward pass of the base model are required.
arXiv Detail & Related papers (2025-02-05T20:47:44Z) - Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models [1.6385815610837167]
Pivoting Factorization (PIFA) is a novel low-rank representation that unsupervisedly learns a compact form of any low-rank representation.<n>PIFA achieves 24.2% additional memory savings and 24.6% faster inference over low-rank layers at rank = 50% of dimension.<n>MPIFA, combining M and PIFA into an end-to-end framework, significantly outperforms existing low-rank pruning methods.
arXiv Detail & Related papers (2025-01-31T12:36:31Z) - Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations.
Post-training pruning methods are proposed to prune LLMs in one-shot without retraining.
Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z) - ARB-LLM: Alternating Refined Binarizations for Large Language Models [82.24826360906341]
ARB-LLM is a novel 1-bit post-training quantization (PTQ) technique tailored for Large Language Models (LLMs)
As a binary PTQ method, our ARB-LLM$_textRC$ is the first to surpass FP16 models of the same size.
arXiv Detail & Related papers (2024-10-04T03:50:10Z) - A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models [24.185245582500876]
We introduce FISTAPruner, the first post-training pruner based on convex optimization models and algorithms.
FISTAPruner incorporates an intra-layer cumulative error correction mechanism and supports parallel pruning.
We evaluate FISTAPruner on models such as OPT, LLaMA, LLaMA-2, and LLaMA-3 with 125M to 70B parameters under unstructured and 2:4 semi-structured sparsity.
arXiv Detail & Related papers (2024-08-07T12:33:46Z) - Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining [16.026565606764954]
We simplify the pruning process for Transformer-based large language models (LLMs)
We propose two inference-aware pruning criteria derived from the optimization perspective of output approximation.
We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining.
arXiv Detail & Related papers (2024-07-26T23:53:59Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models.
Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Just CHOP: Embarrassingly Simple LLM Compression [27.64461490974072]
Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint.
We show that simple layer pruning coupled with an extended language model pretraining produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale.
We also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.
arXiv Detail & Related papers (2023-05-24T08:18:35Z) - Advancing Model Pruning via Bi-level Optimization [89.88761425199598]
iterative magnitude pruning (IMP) is the predominant pruning method to successfully find 'winning tickets'
One-shot pruning methods have been developed, but these schemes are usually unable to find winning tickets as good as IMP.
We show that the proposed bi-level optimization-oriented pruning method (termed BiP) is a special class of BLO problems with a bi-linear problem structure.
arXiv Detail & Related papers (2022-10-08T19:19:29Z) - Efficient Micro-Structured Weight Unification and Pruning for Neural
Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices.
Previous unstructured or structured weight pruning methods can hardly truly accelerate inference.
We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.