Related papers: DeepCuts: Single-Shot Interpretability based Pruning for BERT

DeepCuts: Single-Shot Interpretability based Pruning for BERT

URL: http://arxiv.org/abs/2212.13392v1
Date: Tue, 27 Dec 2022 07:21:41 GMT
Title: DeepCuts: Single-Shot Interpretability based Pruning for BERT
Authors: Jasdeep Singh Grover, Bhavesh Gawri, Ruskin Raj Manku
Abstract summary: We show that our scoring functions are able to assign more relevant task-based scores to the network parameters. We also analyze our pruning masks and find them to be significantly different from the ones obtained using standard metrics.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As language models have grown in parameters and layers, it has become much harder to train and infer with them on single GPUs. This is severely restricting the availability of large language models such as GPT-3, BERT-Large, and many others. A common technique to solve this problem is pruning the network architecture by removing transformer heads, fully-connected weights, and other modules. The main challenge is to discern the important parameters from the less important ones. Our goal is to find strong metrics for identifying such parameters. We thus propose two strategies: Cam-Cut based on the GradCAM interpretations, and Smooth-Cut based on the SmoothGrad, for calculating the importance scores. Through this work, we show that our scoring functions are able to assign more relevant task-based scores to the network parameters, and thus both our pruning approaches significantly outperform the standard weight and gradient-based strategies, especially at higher compression ratios in BERT-based models. We also analyze our pruning masks and find them to be significantly different from the ones obtained using standard metrics.

Related papers

RSQ: Learning from Important Tokens Leads to Better Quantized LLMs [65.5558181902098]
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. We propose RSQ (Rotate, Scale, then Quantize), which applies rotations to the model to mitigate outliers. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families.
arXiv Detail & Related papers (2025-03-03T18:46:33Z)
MPruner: Optimizing Neural Network Size with CKA-Based Mutual Information Pruning [7.262751938473306]
Pruning is a well-established technique that reduces the size of neural networks while mathematically guaranteeing accuracy preservation. We develop a new pruning algorithm, MPruner, that leverages mutual information through vector similarity. MPruner achieved up to a 50% reduction in parameters and memory usage for CNN and transformer-based models, with minimal to no loss in accuracy.
arXiv Detail & Related papers (2024-08-24T05:54:47Z)
MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection [54.545054873239295]
Deepfakes have recently raised significant trust issues and security concerns among the public. ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance. This work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach.
arXiv Detail & Related papers (2024-04-12T13:02:08Z)
Learning to Compose SuperWeights for Neural Parameter Allocation Search [61.078949532440724]
We show that our approach can generate parameters for many network using the same set of weights. This enables us to support tasks like efficient ensembling and anytime prediction.
arXiv Detail & Related papers (2023-12-03T04:20:02Z)
Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models [30.246821533532017]
Large Language Models (LLMs) with billions of parameters are prime targets for network pruning, removing some model weights without hurting performance. We present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner)
arXiv Detail & Related papers (2023-11-08T18:59:54Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance [114.1541203743303]
We propose PLATON, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation. We conduct extensive experiments with several Transformer-based models on natural language understanding, question answering and image classification.
arXiv Detail & Related papers (2022-06-25T05:38:39Z)
Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a. sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training. Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z)
Robust Training of Neural Networks using Scale Invariant Architectures [70.67803417918854]
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks. We show that this general approach is robust to rescaling of parameter and loss. We design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam.
arXiv Detail & Related papers (2022-02-02T11:58:56Z)
BERMo: What can BERT learn from ELMo? [6.417011237981518]
We use linear combination scheme proposed in Embeddings from Language Models (ELMo) to combine the scaled internal representations from different network depths. Our approach has two-fold benefits: (1) improved gradient flow for the downstream task and (2) increased representative power.
arXiv Detail & Related papers (2021-10-18T17:35:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.