IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization
- URL: http://arxiv.org/abs/2005.02178v2
- Date: Thu, 4 Feb 2021 01:40:26 GMT
- Title: IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization
- Authors: Wenxuan Zhou, Bill Yuchen Lin, Xiang Ren
- Abstract summary: Fine-tuning pre-trained language models (PTLMs) has been a common practice for advancing performance in natural language understanding (NLU) tasks.
Recent advance in representation learning shows that isotropic embeddings can significantly improve performance on downstream tasks with faster convergence and better generalization.
We analyze the isotropy of the pre-trained embeddings in PTLMs with straightforward visualization, and point out two major issues: high variance in their standard deviation, and high correlation between different dimensions.
- Score: 41.267328947683936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning pre-trained language models (PTLMs), such as BERT and its better
variant RoBERTa, has been a common practice for advancing performance in
natural language understanding (NLU) tasks. Recent advance in representation
learning shows that isotropic (i.e., unit-variance and uncorrelated) embeddings
can significantly improve performance on downstream tasks with faster
convergence and better generalization. The isotropy of the pre-trained
embeddings in PTLMs, however, is relatively under-explored. In this paper, we
analyze the isotropy of the pre-trained [CLS] embeddings of PTLMs with
straightforward visualization, and point out two major issues: high variance in
their standard deviation, and high correlation between different dimensions. We
also propose a new network regularization method, isotropic batch normalization
(IsoBN) to address the issues, towards learning more isotropic representations
in fine-tuning by dynamically penalizing dominating principal components. This
simple yet effective fine-tuning method yields about 1.0 absolute increment on
the average of seven NLU tasks.
Related papers
- InstructDiff: Domain-Adaptive Data Selection via Differential Entropy for Efficient LLM Fine-Tuning [35.89674702985539]
InstructDiff is a unified framework that operationalizes differential entropy as a domain-adaptive selection criterion.<n>We show that InstructDiff achieves 17% relative improvement over full data training on mathematical reasoning and 52% for general instruction-following.
arXiv Detail & Related papers (2026-01-30T14:15:44Z) - Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning [104.27224674122313]
Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks.
To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions.
arXiv Detail & Related papers (2024-11-17T01:16:37Z) - PACE: marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization [35.922096876707975]
PACE is a generalization of PArameter-efficient fine-tuning with Consistency rEgularization.
We show that PACE implicitly regularizes gradients for enhanced generalization, but also implicitly aligns the fine-tuned and pre-trained models to retain knowledge.
PACE outperforms existing PEFT methods in four visual adaptation tasks: VTAB-1k, FGVC, few-shot learning and domain adaptation.
arXiv Detail & Related papers (2024-09-25T17:56:00Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views [28.081794908107604]
Fine-tuning is used to leverage the power of pre-trained foundation models in new downstream tasks.
Recent studies have observed challenges in the generalization of fine-tuned models to unseen distributions.
We propose a novel generalizable fine-tuning method LEVI, where the pre-trained model is adaptively ensembled layer-wise with a small task-specific model.
arXiv Detail & Related papers (2024-02-07T08:16:40Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Stable Anisotropic Regularization [18.52015282224059]
We propose I-STAR: IsoScore*-based STable Anisotropic Regularization, a novel regularization method that can be used to increase or decrease levels of isotropy in embedding space during training.
I-STAR uses IsoScore*, the first accurate measure of isotropy that is both differentiable and stable on mini-batch computations.
arXiv Detail & Related papers (2023-05-30T18:57:45Z) - TWINS: A Fine-Tuning Framework for Improved Transferability of
Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks.
We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework.
TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z) - Double Forward Propagation for Memorized Batch Normalization [68.34268180871416]
Batch Normalization (BN) has been a standard component in designing deep neural networks (DNNs)
We propose a memorized batch normalization (MBN) which considers multiple recent batches to obtain more accurate and robust statistics.
Compared to related methods, the proposed MBN exhibits consistent behaviors in both training and inference.
arXiv Detail & Related papers (2020-10-10T08:48:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.