How to Prune Your Language Model: Recovering Accuracy on the "Sparsity
May Cry'' Benchmark
- URL: http://arxiv.org/abs/2312.13547v1
- Date: Thu, 21 Dec 2023 03:11:30 GMT
- Title: How to Prune Your Language Model: Recovering Accuracy on the "Sparsity
May Cry'' Benchmark
- Authors: Eldar Kurtic, Torsten Hoefler, Dan Alistarh
- Abstract summary: We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets.
We propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark.
- Score: 60.72725673114168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pruning large language models (LLMs) from the BERT family has emerged as a
standard compression benchmark, and several pruning methods have been proposed
for this task. The recent ``Sparsity May Cry'' (SMC) benchmark put into
question the validity of all existing methods, exhibiting a more complex setup
where many known pruning methods appear to fail. We revisit the question of
accurate BERT-pruning during fine-tuning on downstream datasets, and propose a
set of general guidelines for successful pruning, even on the challenging SMC
benchmark. First, we perform a cost-vs-benefits analysis of pruning model
components, such as the embeddings and the classification head; second, we
provide a simple-yet-general way of scaling training, sparsification and
learning rate schedules relative to the desired target sparsity; finally, we
investigate the importance of proper parametrization for Knowledge Distillation
in the context of LLMs. Our simple insights lead to state-of-the-art results,
both on classic BERT-pruning benchmarks, as well as on the SMC benchmark,
showing that even classic gradual magnitude pruning (GMP) can yield competitive
results, with the right approach.
Related papers
- Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications.
Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z) - FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing [17.01412432658081]
Large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws.
We propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens.
Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods.
arXiv Detail & Related papers (2024-12-16T07:09:46Z) - Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling [3.873482175367558]
In this paper, we treat the Generation of each token by Large Language Model (LLM) as a Classification (GaC) for ensembling.
In experiments, we ensemble state-of-the-art LLMs on several benchmarks, including exams, mathematics and reasoning, and observe that our method breaks the existing community performance ceiling.
arXiv Detail & Related papers (2024-06-18T13:17:26Z) - MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures [57.886592207948844]
We propose MixEval, a new paradigm for establishing efficient, gold-standard evaluation by strategically mixing off-the-shelf benchmarks.
It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks.
arXiv Detail & Related papers (2024-06-03T05:47:05Z) - Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple
Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class.
Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z) - Learning Efficient Coding of Natural Images with Maximum Manifold
Capacity Representations [4.666056064419346]
The efficient coding hypothesis proposes that the response properties of sensory systems are adapted to the statistics of their inputs.
While elegant, information theoretic properties are notoriously difficult to measure in practical settings or to employ as objective functions in optimization.
Here we outline the assumptions that allow manifold capacity to be optimized directly, yielding Maximum Manifold Capacity Representations (MMCR)
arXiv Detail & Related papers (2023-03-06T17:26:30Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - GMP*: Well-Tuned Global Magnitude Pruning Can Outperform Most
BERT-Pruning Methods [27.761221746022365]
We revisit the performance of the classic gradual magnitude pruning (GMP) baseline for large language models.
We show that a simple and general variant, which we call GMP*, can match and sometimes outperform more complex state-of-the-art methods.
arXiv Detail & Related papers (2022-10-12T16:35:47Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.