How to Prune Your Language Model: Recovering Accuracy on the "Sparsity
May Cry'' Benchmark
- URL: http://arxiv.org/abs/2312.13547v1
- Date: Thu, 21 Dec 2023 03:11:30 GMT
- Title: How to Prune Your Language Model: Recovering Accuracy on the "Sparsity
May Cry'' Benchmark
- Authors: Eldar Kurtic, Torsten Hoefler, Dan Alistarh
- Abstract summary: We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets.
We propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark.
- Score: 60.72725673114168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pruning large language models (LLMs) from the BERT family has emerged as a
standard compression benchmark, and several pruning methods have been proposed
for this task. The recent ``Sparsity May Cry'' (SMC) benchmark put into
question the validity of all existing methods, exhibiting a more complex setup
where many known pruning methods appear to fail. We revisit the question of
accurate BERT-pruning during fine-tuning on downstream datasets, and propose a
set of general guidelines for successful pruning, even on the challenging SMC
benchmark. First, we perform a cost-vs-benefits analysis of pruning model
components, such as the embeddings and the classification head; second, we
provide a simple-yet-general way of scaling training, sparsification and
learning rate schedules relative to the desired target sparsity; finally, we
investigate the importance of proper parametrization for Knowledge Distillation
in the context of LLMs. Our simple insights lead to state-of-the-art results,
both on classic BERT-pruning benchmarks, as well as on the SMC benchmark,
showing that even classic gradual magnitude pruning (GMP) can yield competitive
results, with the right approach.
Related papers
- Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models [0.29687381456164]
We propose a more flexible benchmarking approach for Large Language Models (LLMs)
Our method, textittextbfVarco Arena, provides reference-free benchmarking of LLMs in tournament style.
Our empirical results, supported by simulation experiments, demonstrate that the textittextbfVarco Arena tournament approach aligns better with the current Elo model for benchmarking LLMs.
arXiv Detail & Related papers (2024-11-02T15:23:28Z) - Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling [3.873482175367558]
In this paper, we treat the Generation of each token by Large Language Model (LLM) as a Classification (GaC) for ensembling.
In experiments, we ensemble state-of-the-art LLMs on several benchmarks, including exams, mathematics and reasoning, and observe that our method breaks the existing community performance ceiling.
arXiv Detail & Related papers (2024-06-18T13:17:26Z) - MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures [57.886592207948844]
We propose MixEval, a new paradigm for establishing efficient, gold-standard evaluation by strategically mixing off-the-shelf benchmarks.
It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks.
arXiv Detail & Related papers (2024-06-03T05:47:05Z) - Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple
Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class.
Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z) - Learning Efficient Coding of Natural Images with Maximum Manifold
Capacity Representations [4.666056064419346]
The efficient coding hypothesis proposes that the response properties of sensory systems are adapted to the statistics of their inputs.
While elegant, information theoretic properties are notoriously difficult to measure in practical settings or to employ as objective functions in optimization.
Here we outline the assumptions that allow manifold capacity to be optimized directly, yielding Maximum Manifold Capacity Representations (MMCR)
arXiv Detail & Related papers (2023-03-06T17:26:30Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - GMP*: Well-Tuned Global Magnitude Pruning Can Outperform Most
BERT-Pruning Methods [27.761221746022365]
We revisit the performance of the classic gradual magnitude pruning (GMP) baseline for large language models.
We show that a simple and general variant, which we call GMP*, can match and sometimes outperform more complex state-of-the-art methods.
arXiv Detail & Related papers (2022-10-12T16:35:47Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Mutual-Information Based Few-Shot Classification [34.95314059362982]
We introduce Transductive Infomation Maximization (TIM) for few-shot learning.
Our method maximizes the mutual information between the query features and their label predictions for a given few-shot task.
We propose a new alternating-direction solver, which speeds up transductive inference over gradient-based optimization.
arXiv Detail & Related papers (2021-06-23T09:17:23Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.