Related papers: PGB: One-Shot Pruning for BERT via Weight Grouping and Permutation

PGB: One-Shot Pruning for BERT via Weight Grouping and Permutation

URL: http://arxiv.org/abs/2502.03984v1
Date: Thu, 06 Feb 2025 11:34:41 GMT
Title: PGB: One-Shot Pruning for BERT via Weight Grouping and Permutation
Authors: Hyemin Lim, Jaeyeon Lee, Dong-Wan Choi,
Abstract summary: This paper proposes a novel semi-structured one-shot pruning method for BERT, called $textitPermutation and Grouping for BERT$ (PGB)<n>PGB identifies important groups of individual weights by permutation and prunes all other weights as a structure in both multi-head attention and feed-forward layers.<n>Our experimental results on BERT$_textBASE$ demonstrate that PGB outperforms the state-of-the-art structured pruning methods in terms of computational cost and accuracy preservation.
Score: 5.888489927450056
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large pretrained language models such as BERT suffer from slow inference and high memory usage, due to their huge size. Recent approaches to compressing BERT rely on iterative pruning and knowledge distillation, which, however, are often too complicated and computationally intensive. This paper proposes a novel semi-structured one-shot pruning method for BERT, called $\textit{Permutation and Grouping for BERT}$ (PGB), which achieves high compression efficiency and sparsity while preserving accuracy. To this end, PGB identifies important groups of individual weights by permutation and prunes all other weights as a structure in both multi-head attention and feed-forward layers. Furthermore, if no important group is formed in a particular layer, PGB drops the entire layer to produce an even more compact model. Our experimental results on BERT$_{\text{BASE}}$ demonstrate that PGB outperforms the state-of-the-art structured pruning methods in terms of computational cost and accuracy preservation.

Related papers

LGBQPC: Local Granular-Ball Quality Peaks Clustering [51.58924743533048]
The density peaks clustering (DPC) algorithm has attracted considerable attention for its ability to detect arbitrarily shaped clusters.<n>Recent advancements integrating granular-ball computing with DPC have led to the GB-based DPC algorithm, which improves computational efficiency.<n>This paper proposes the local GB quality peaks clustering (LGBQPC) algorithm, which offers comprehensive improvements to GBDPC in both GB generation and clustering processes.
arXiv Detail & Related papers (2025-05-16T15:26:02Z)
Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT [0.0]
Transformer-based models, specifically BERT, have propelled research in various NLP tasks. BERT models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting with long input. We propose a relatively simple extension to vanilla BERT architecture called ChunkBERT that allows finetuning of any pretrained models to perform inference on arbitrarily long text.
arXiv Detail & Related papers (2023-10-31T15:41:08Z)
Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a. sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training. Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z)
Exploring Extreme Parameter Compression for Pre-trained Language Models [45.80044281531393]
This work explores larger compression ratios for pre-trained language models (PLMs) Two decomposition and reconstruction protocols are proposed to improve the effectiveness and efficiency during compression. A tiny version achieves $96.7%$ performance of BERT-base with $ 1/48 $ encoder parameters and $2.7 times$ faster on inference.
arXiv Detail & Related papers (2022-05-20T09:16:55Z)
Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z)
The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models [23.12519490211362]
This paper studies the accuracy-compression trade-off for unstructured weight pruning in the context of BERT models. We introduce Optimal BERT Surgeon (O-BERT-S), an efficient and accurate weight pruning method based on approximate second-order information. We investigate the impact of this pruning method when compounding compression approaches for Transformer-based models.
arXiv Detail & Related papers (2022-03-14T16:40:31Z)
BiBERT: Accurate Fully Binarized BERT [69.35727280997617]
BiBERT is an accurate fully binarized BERT to eliminate the performance bottlenecks. Our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size.
arXiv Detail & Related papers (2022-03-12T09:46:13Z)
Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks. These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices. We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z)
ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques [10.983311133796745]
Pre-trained language models of the BERT family have defined the state-of-the-arts in a wide range of NLP tasks. Performance of BERT-based models is mainly driven by the enormous amount of parameters, which hinders their application to resource-limited scenarios. We introduce three kinds of compression methods (weight pruning, low-rank factorization and knowledge distillation) and explore a range of designs concerning model architecture. Our best compressed model, dubbed Refined BERT cOmpreSsion with InTegrAted techniques (ROSITA), is $7.5 times$ smaller than
arXiv Detail & Related papers (2021-03-21T11:33:33Z)
BinaryBERT: Pushing the Limit of BERT Quantization [74.65543496761553]
We propose BinaryBERT, which pushes BERT quantization to the limit with weight binarization. We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscapes. Empirical results show that BinaryBERT has negligible performance drop compared to the full-precision BERT-base.
arXiv Detail & Related papers (2020-12-31T16:34:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.