Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs
"Difficult" Downstream Tasks in LLMs
- URL: http://arxiv.org/abs/2310.02277v2
- Date: Fri, 16 Feb 2024 21:10:12 GMT
- Title: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs
"Difficult" Downstream Tasks in LLMs
- Authors: Lu Yin, Ajay Jaiswal, Shiwei Liu, Souvik Kundu, Zhangyang Wang
- Abstract summary: It has been believed that weights in large language models (LLMs) contain significant redundancy.
This paper presents a counter-argument: small-magnitude weights of pre-trained model weights encode vital knowledge essential for tackling difficult downstream tasks.
- Score: 71.56345106591789
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Junk DNA Hypothesis by adopting a novel task-centric angle for the
pre-trained weights of large language models (LLMs). It has been believed that
weights in LLMs contain significant redundancy, leading to the conception that
a considerable chunk of the parameters can be removed by pruning without
compromising performance. Contrary to this belief, this paper presents a
counter-argument: small-magnitude weights of pre-trained model weights encode
vital knowledge essential for tackling difficult downstream tasks - manifested
as the monotonic relationship between the performance drop of downstream tasks
across the difficulty spectrum, as we prune more pre-trained weights by
magnitude. Moreover, we reveal that these seemingly inconsequential weights can
result in irreparable loss of knowledge and performance degradation in
difficult tasks, even when downstream continual training is allowed.
Interestingly, our evaluations show that the other popular compression, namely
quantization, fails to exhibit similar monotonic effect and does not as
convincingly disentangle this task-difficulty information. To study formally,
we introduce several quantifiable metrics to gauge the downstream task
difficulty: (1) within the same task category, and (2) across different task
categories. Our extensive experiments substantiate the Junk DNA Hypothesis
across a diverse range of model sizes, tasks, datasets, and even pruning
methods. Codes are available at:
https://github.com/VITA-Group/Junk_DNA_Hypothesis.git.
Related papers
- Less is More: On the Feature Redundancy of Pretrained Models When
Transferring to Few-shot Tasks [120.23328563831704]
Transferring a pretrained model to a downstream task can be as easy as conducting linear probing with target data.
We show that, for linear probing, the pretrained features can be extremely redundant when the downstream data is scarce.
arXiv Detail & Related papers (2023-10-05T19:00:49Z) - Exploring Weight Balancing on Long-Tailed Recognition Problem [32.01426831450348]
Recognition problems in long-tailed data, in which the sample size per class is heavily skewed, have gained importance.
Weight balancing, which combines classical regularization techniques with two-stage training, has been proposed.
We analyze weight balancing by focusing on neural collapse and the cone effect at each training stage.
arXiv Detail & Related papers (2023-05-26T01:45:19Z) - Understanding Difficulty-based Sample Weighting with a Universal
Difficulty Measure [2.7413469516930578]
A large number of weighting methods essentially utilize the learning difficulty of training samples to calculate their weights.
The learning difficulties of the samples are determined by multiple factors including noise level, imbalance degree, margin, and uncertainty.
In this study, we theoretically prove that the generalization error of a sample can be used as a universal difficulty measure.
arXiv Detail & Related papers (2023-01-12T07:28:32Z) - Discrete Key-Value Bottleneck [95.61236311369821]
Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant.
One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning.
Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks.
We propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes.
arXiv Detail & Related papers (2022-07-22T17:52:30Z) - PLATON: Pruning Large Transformer Models with Upper Confidence Bound of
Weight Importance [114.1541203743303]
We propose PLATON, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation.
We conduct extensive experiments with several Transformer-based models on natural language understanding, question answering and image classification.
arXiv Detail & Related papers (2022-06-25T05:38:39Z) - Scale-Equivalent Distillation for Semi-Supervised Object Detection [57.59525453301374]
Recent Semi-Supervised Object Detection (SS-OD) methods are mainly based on self-training, generating hard pseudo-labels by a teacher model on unlabeled data as supervisory signals.
We analyze the challenges these methods meet with the empirical experiment results.
We introduce a novel approach, Scale-Equivalent Distillation (SED), which is a simple yet effective end-to-end knowledge distillation framework robust to large object size variance and class imbalance.
arXiv Detail & Related papers (2022-03-23T07:33:37Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - Instance-Level Task Parameters: A Robust Multi-task Weighting Framework [17.639472693362926]
Recent works have shown that deep neural networks benefit from multi-task learning by learning a shared representation across several related tasks.
We let the training process dictate the optimal weighting of tasks for every instance in the dataset.
We conduct extensive experiments on SURREAL and CityScapes datasets, for human shape and pose estimation, depth estimation and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-11T02:35:42Z) - Multi-Loss Weighting with Coefficient of Variations [19.37721431024278]
We propose a weighting scheme based on the coefficient of variations and set the weights based on properties observed while training the model.
The proposed method incorporates a measure of uncertainty to balance the losses, and as a result the loss weights evolve during training without requiring another (learning based) optimisation.
The validity of the approach is shown empirically for depth estimation and semantic segmentation on multiple datasets.
arXiv Detail & Related papers (2020-09-03T14:51:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.