Bi-Drop: Enhancing Fine-tuning Generalization via Synchronous sub-net
Estimation and Optimization
- URL: http://arxiv.org/abs/2305.14760v2
- Date: Sun, 22 Oct 2023 11:46:33 GMT
- Title: Bi-Drop: Enhancing Fine-tuning Generalization via Synchronous sub-net
Estimation and Optimization
- Authors: Shoujie Tong, Heming Xia, Damai Dai, Runxin Xu, Tianyu Liu, Binghuai
Lin, Yunbo Cao, Zhifang Sui
- Abstract summary: Bi-Drop is a fine-tuning strategy that selectively updates model parameters using gradients from various sub-nets.
Experiments on the GLUE benchmark demonstrate that Bi-Drop consistently outperforms previous fine-tuning methods.
- Score: 58.90989478049686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained language models have achieved remarkable success in natural
language understanding. However, fine-tuning pretrained models on limited
training data tends to overfit and thus diminish performance. This paper
presents Bi-Drop, a fine-tuning strategy that selectively updates model
parameters using gradients from various sub-nets dynamically generated by
dropout. The sub-net estimation of Bi-Drop is performed in an in-batch manner,
so it overcomes the problem of hysteresis in sub-net updating, which is
possessed by previous methods that perform asynchronous sub-net estimation.
Also, Bi-Drop needs only one mini-batch to estimate the sub-net so it achieves
higher utility of training data. Experiments on the GLUE benchmark demonstrate
that Bi-Drop consistently outperforms previous fine-tuning methods.
Furthermore, empirical results also show that Bi-Drop exhibits excellent
generalization ability and robustness for domain transfer, data imbalance, and
low-resource scenarios.
Related papers
- A Bayesian Approach to Data Point Selection [24.98069363998565]
Data point selection (DPS) is becoming a critical topic in deep learning.
Existing approaches to DPS are predominantly based on a bi-level optimisation (BLO) formulation.
We propose a novel Bayesian approach to DPS.
arXiv Detail & Related papers (2024-11-06T09:04:13Z) - TWINS: A Fine-Tuning Framework for Improved Transferability of
Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks.
We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework.
TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z) - End-to-End Weak Supervision [15.125993628007972]
We propose an end-to-end approach for directly learning the downstream model.
We show improved performance over prior work in terms of end model performance on downstream test sets.
arXiv Detail & Related papers (2021-07-05T19:10:11Z) - Bi-tuning of Pre-trained Representations [79.58542780707441]
Bi-tuning is a general learning framework to fine-tune both supervised and unsupervised pre-trained representations to downstream tasks.
Bi-tuning generalizes the vanilla fine-tuning by integrating two heads upon the backbone of pre-trained representations.
Bi-tuning achieves state-of-the-art results for fine-tuning tasks of both supervised and unsupervised pre-trained models by large margins.
arXiv Detail & Related papers (2020-11-12T03:32:25Z) - Advanced Dropout: A Model-free Methodology for Bayesian Dropout
Optimization [62.8384110757689]
Overfitting ubiquitously exists in real-world applications of deep neural networks (DNNs)
The advanced dropout technique applies a model-free and easily implemented distribution with parametric prior, and adaptively adjusts dropout rate.
We evaluate the effectiveness of the advanced dropout against nine dropout techniques on seven computer vision datasets.
arXiv Detail & Related papers (2020-10-11T13:19:58Z) - On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and
Strong Baselines [31.807628937487927]
Fine-tuning pre-trained language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks.
Previous literature identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets.
We show that both hypotheses fail to explain the fine-tuning instability.
arXiv Detail & Related papers (2020-06-08T19:06:24Z) - Learnable Bernoulli Dropout for Bayesian Deep Learning [53.79615543862426]
Learnable Bernoulli dropout (LBD) is a new model-agnostic dropout scheme that considers the dropout rates as parameters jointly optimized with other model parameters.
LBD leads to improved accuracy and uncertainty estimates in image classification and semantic segmentation.
arXiv Detail & Related papers (2020-02-12T18:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.