Noise Stability Regularization for Improving BERT Fine-tuning
- URL: http://arxiv.org/abs/2107.04835v1
- Date: Sat, 10 Jul 2021 13:19:04 GMT
- Title: Noise Stability Regularization for Improving BERT Fine-tuning
- Authors: Hang Hua, Xingjian Li, Dejing Dou, Cheng-Zhong Xu, Jiebo Luo
- Abstract summary: Fine-tuning pre-trained language models such as BERT has become a common practice dominating leaderboards across various NLP tasks.
We introduce a novel and effective regularization method to improve fine-tuning on NLP tasks, referred to as Layer-wise Noise Stability Regularization (LNSR)
We experimentally confirm that well-performing models show a low sensitivity to noise and fine-tuning with LNSR exhibits clearly higher generalizability and stability.
- Score: 94.80511419444723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning pre-trained language models such as BERT has become a common
practice dominating leaderboards across various NLP tasks. Despite its recent
success and wide adoption, this process is unstable when there are only a small
number of training samples available. The brittleness of this process is often
reflected by the sensitivity to random seeds. In this paper, we propose to
tackle this problem based on the noise stability property of deep nets, which
is investigated in recent literature (Arora et al., 2018; Sanyal et al., 2020).
Specifically, we introduce a novel and effective regularization method to
improve fine-tuning on NLP tasks, referred to as Layer-wise Noise Stability
Regularization (LNSR). We extend the theories about adding noise to the input
and prove that our method gives a stabler regularization effect. We provide
supportive evidence by experimentally confirming that well-performing models
show a low sensitivity to noise and fine-tuning with LNSR exhibits clearly
higher generalizability and stability. Furthermore, our method also
demonstrates advantages over other state-of-the-art algorithms including L2-SP
(Li et al., 2018), Mixout (Lee et al., 2020) and SMART (Jiang et al., 2020).
Related papers
- Stable Neighbor Denoising for Source-free Domain Adaptive Segmentation [91.83820250747935]
Pseudo-label noise is mainly contained in unstable samples in which predictions of most pixels undergo significant variations during self-training.
We introduce the Stable Neighbor Denoising (SND) approach, which effectively discovers highly correlated stable and unstable samples.
SND consistently outperforms state-of-the-art methods in various SFUDA semantic segmentation settings.
arXiv Detail & Related papers (2024-06-10T21:44:52Z) - Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns.
We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions.
Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z) - Latent Class-Conditional Noise Model [54.56899309997246]
We introduce a Latent Class-Conditional Noise model (LCCN) to parameterize the noise transition under a Bayesian framework.
We then deduce a dynamic label regression method for LCCN, whose Gibbs sampler allows us efficiently infer the latent true labels.
Our approach safeguards the stable update of the noise transition, which avoids previous arbitrarily tuning from a mini-batch of samples.
arXiv Detail & Related papers (2023-02-19T15:24:37Z) - Identifying Hard Noise in Long-Tailed Sample Distribution [76.16113794808001]
We introduce Noisy Long-Tailed Classification (NLT)
Most de-noising methods fail to identify the hard noises.
We design an iterative noisy learning framework called Hard-to-Easy (H2E)
arXiv Detail & Related papers (2022-07-27T09:03:03Z) - Clipped Stochastic Methods for Variational Inequalities with
Heavy-Tailed Noise [64.85879194013407]
We prove the first high-probability results with logarithmic dependence on the confidence level for methods for solving monotone and structured non-monotone VIPs.
Our results match the best-known ones in the light-tails case and are novel for structured non-monotone problems.
In addition, we numerically validate that the gradient noise of many practical formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA.
arXiv Detail & Related papers (2022-06-02T15:21:55Z) - Square Root Principal Component Pursuit: Tuning-Free Noisy Robust Matrix
Recovery [8.581512812219737]
We propose a new framework for low-rank matrix recovery from observations corrupted with noise and outliers.
Inspired by the square root Lasso, this new formulation does not require prior knowledge of the noise level.
We show that a single, universal choice of the regularization parameter suffices to achieve reconstruction error proportional to the (a priori unknown) noise level.
arXiv Detail & Related papers (2021-06-17T02:28:11Z) - Noisy Recurrent Neural Networks [45.94390701863504]
We study recurrent neural networks (RNNs) trained by injecting noise into hidden states as discretizations of differential equations driven by input data.
We find that, under reasonable assumptions, this implicit regularization promotes flatter minima; it biases towards models with more stable dynamics; and, in classification tasks, it favors models with larger classification margin.
Our theory is supported by empirical results which demonstrate improved robustness with respect to various input perturbations, while maintaining state-of-the-art performance.
arXiv Detail & Related papers (2021-02-09T15:20:50Z) - On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and
Strong Baselines [31.807628937487927]
Fine-tuning pre-trained language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks.
Previous literature identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets.
We show that both hypotheses fail to explain the fine-tuning instability.
arXiv Detail & Related papers (2020-06-08T19:06:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.