Related papers: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs

Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs

URL: http://arxiv.org/abs/2310.02277v2
Date: Fri, 16 Feb 2024 21:10:12 GMT
Title: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs
Authors: Lu Yin, Ajay Jaiswal, Shiwei Liu, Souvik Kundu, Zhangyang Wang
Abstract summary: It has been believed that weights in large language models (LLMs) contain significant redundancy. This paper presents a counter-argument: small-magnitude weights of pre-trained model weights encode vital knowledge essential for tackling difficult downstream tasks.
Score: 71.56345106591789
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Junk DNA Hypothesis by adopting a novel task-centric angle for the pre-trained weights of large language models (LLMs). It has been believed that weights in LLMs contain significant redundancy, leading to the conception that a considerable chunk of the parameters can be removed by pruning without compromising performance. Contrary to this belief, this paper presents a counter-argument: small-magnitude weights of pre-trained model weights encode vital knowledge essential for tackling difficult downstream tasks - manifested as the monotonic relationship between the performance drop of downstream tasks across the difficulty spectrum, as we prune more pre-trained weights by magnitude. Moreover, we reveal that these seemingly inconsequential weights can result in irreparable loss of knowledge and performance degradation in difficult tasks, even when downstream continual training is allowed. Interestingly, our evaluations show that the other popular compression, namely quantization, fails to exhibit similar monotonic effect and does not as convincingly disentangle this task-difficulty information. To study formally, we introduce several quantifiable metrics to gauge the downstream task difficulty: (1) within the same task category, and (2) across different task categories. Our extensive experiments substantiate the Junk DNA Hypothesis across a diverse range of model sizes, tasks, datasets, and even pruning methods. Codes are available at: https://github.com/VITA-Group/Junk_DNA_Hypothesis.git.

Related papers

Fine-Tuning Attention Modules Only: Enhancing Weight Disentanglement in Task Arithmetic [11.142414096809734]
In recent years, task arithmetic has garnered increasing attention. This approach edits pre-trained models directly in weight space by combining the fine-tuned weights of various tasks into a unified model. Applying such a unified model to individual tasks can lead to interference from other tasks (lack of weight disentanglement)
arXiv Detail & Related papers (2024-07-09T17:59:17Z)
Less is More: On the Feature Redundancy of Pretrained Models When Transferring to Few-shot Tasks [120.23328563831704]
Transferring a pretrained model to a downstream task can be as easy as conducting linear probing with target data. We show that, for linear probing, the pretrained features can be extremely redundant when the downstream data is scarce.
arXiv Detail & Related papers (2023-10-05T19:00:49Z)
Exploring Weight Balancing on Long-Tailed Recognition Problem [32.01426831450348]
Recognition problems in long-tailed data, in which the sample size per class is heavily skewed, have gained importance. Weight balancing, which combines classical regularization techniques with two-stage training, has been proposed. We analyze weight balancing by focusing on neural collapse and the cone effect at each training stage.
arXiv Detail & Related papers (2023-05-26T01:45:19Z)
Understanding Difficulty-based Sample Weighting with a Universal Difficulty Measure [2.7413469516930578]
A large number of weighting methods essentially utilize the learning difficulty of training samples to calculate their weights. The learning difficulties of the samples are determined by multiple factors including noise level, imbalance degree, margin, and uncertainty. In this study, we theoretically prove that the generalization error of a sample can be used as a universal difficulty measure.
arXiv Detail & Related papers (2023-01-12T07:28:32Z)
Discrete Key-Value Bottleneck [95.61236311369821]
Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant. One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning. Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks. We propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes.
arXiv Detail & Related papers (2022-07-22T17:52:30Z)
PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance [114.1541203743303]
We propose PLATON, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation. We conduct extensive experiments with several Transformer-based models on natural language understanding, question answering and image classification.
arXiv Detail & Related papers (2022-06-25T05:38:39Z)
Scale-Equivalent Distillation for Semi-Supervised Object Detection [57.59525453301374]
Recent Semi-Supervised Object Detection (SS-OD) methods are mainly based on self-training, generating hard pseudo-labels by a teacher model on unlabeled data as supervisory signals. We analyze the challenges these methods meet with the empirical experiment results. We introduce a novel approach, Scale-Equivalent Distillation (SED), which is a simple yet effective end-to-end knowledge distillation framework robust to large object size variance and class imbalance.
arXiv Detail & Related papers (2022-03-23T07:33:37Z)
CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance. Sample re-weighting methods are popularly used to alleviate this data bias issue. We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z)
Instance-Level Task Parameters: A Robust Multi-task Weighting Framework [17.639472693362926]
Recent works have shown that deep neural networks benefit from multi-task learning by learning a shared representation across several related tasks. We let the training process dictate the optimal weighting of tasks for every instance in the dataset. We conduct extensive experiments on SURREAL and CityScapes datasets, for human shape and pose estimation, depth estimation and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-11T02:35:42Z)
Multi-Loss Weighting with Coefficient of Variations [19.37721431024278]
We propose a weighting scheme based on the coefficient of variations and set the weights based on properties observed while training the model. The proposed method incorporates a measure of uncertainty to balance the losses, and as a result the loss weights evolve during training without requiring another (learning based) optimisation. The validity of the approach is shown empirically for depth estimation and semantic segmentation on multiple datasets.
arXiv Detail & Related papers (2020-09-03T14:51:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.