Improving Generalization of Pre-trained Language Models via Stochastic
Weight Averaging
- URL: http://arxiv.org/abs/2212.05956v1
- Date: Mon, 12 Dec 2022 15:09:56 GMT
- Title: Improving Generalization of Pre-trained Language Models via Stochastic
Weight Averaging
- Authors: Peng Lu, Ivan Kobyzev, Mehdi Rezagholizadeh, Ahmad Rashid, Ali Ghodsi,
Philippe Langlais
- Abstract summary: Knowledge Distillation (KD) is a commonly used technique for improving the generalization of compact Pre-trained Language Models (PLMs)
We adapt Weight Averaging (SWA), a method encouraging convergence to a flatter minimum, to fine-tune PLMs.
We demonstrate that our adaptation improves the generalization without extra cost.
- Score: 25.856435988848638
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge Distillation (KD) is a commonly used technique for improving the
generalization of compact Pre-trained Language Models (PLMs) on downstream
tasks. However, such methods impose the additional burden of training a
separate teacher model for every new dataset. Alternatively, one may directly
work on the improvement of the optimization procedure of the compact model
toward better generalization. Recent works observe that the flatness of the
local minimum correlates well with better generalization. In this work, we
adapt Stochastic Weight Averaging (SWA), a method encouraging convergence to a
flatter minimum, to fine-tuning PLMs. We conduct extensive experiments on
various NLP tasks (text classification, question answering, and generation) and
different model architectures and demonstrate that our adaptation improves the
generalization without extra computation cost. Moreover, we observe that this
simple optimization technique is able to outperform the state-of-the-art KD
methods for compact models.
Related papers
- A Post-Training Enhanced Optimization Approach for Small Language Models [0.0]
This paper proposes a continuous post-training alignment data construction method for small language models.
The core of this method is based on the data guidance of large models, optimizing the diversity and accuracy of alignment data.
arXiv Detail & Related papers (2024-11-05T09:32:26Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - Domain Generalization Guided by Large-Scale Pre-Trained Priors [24.74398777539288]
Domain generalization (DG) aims to train a model from limited source domains, allowing it to generalize to unknown target domains.
We introduce Fine-Tune with Large-scale pre-trained Priors (FT-LP)
FT-LP incorporates the pre-trained model as a prior into the DG fine-tuning process, ensuring that the model refers to its pre-trained model at each optimization step.
arXiv Detail & Related papers (2024-06-09T03:32:32Z) - MAST: Model-Agnostic Sparsified Training [4.962431253126472]
We introduce a novel optimization problem formulation that departs from the conventional way of minimizing machine learning model loss as a black-box function.
Unlike traditional formulations, the proposed approach explicitly incorporates an initially pre-trained model and random sketch operators.
We present several variants of the Gradient Descent (SGD) method adapted to the new problem formulation.
arXiv Detail & Related papers (2023-11-27T18:56:03Z) - Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders [63.28408887247742]
We study whether training procedures can be improved to yield better generalization capabilities in the resulting models.
We recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives.
arXiv Detail & Related papers (2023-11-16T10:42:58Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - A General Framework for Sample-Efficient Function Approximation in
Reinforcement Learning [132.45959478064736]
We propose a general framework that unifies model-based and model-free reinforcement learning.
We propose a novel estimation function with decomposable structural properties for optimization-based exploration.
Under our framework, a new sample-efficient algorithm namely OPtimization-based ExploRation with Approximation (OPERA) is proposed.
arXiv Detail & Related papers (2022-09-30T17:59:16Z) - Improving Covariance Conditioning of the SVD Meta-layer by Orthogonality [65.67315418971688]
Nearest Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR) are proposed.
Experiments on visual recognition demonstrate that our methods can simultaneously improve the covariance conditioning and generalization.
arXiv Detail & Related papers (2022-07-05T15:39:29Z) - Class-Incremental Learning with Strong Pre-trained Models [97.84755144148535]
Class-incremental learning (CIL) has been widely studied under the setting of starting from a small number of classes (base classes)
We explore an understudied real-world setting of CIL that starts with a strong model pre-trained on a large number of base classes.
Our proposed method is robust and generalizes to all analyzed CIL settings.
arXiv Detail & Related papers (2022-04-07T17:58:07Z) - Adapting by Pruning: A Case Study on BERT [9.963251767416967]
We propose a novel model adaptation paradigm, adapting by pruning, which prunes neural connections in the pre-trained model to optimise the performance on the target task.
We formulate adapting-by-pruning as an optimisation problem with a differentiable loss and propose an efficient algorithm to prune the model.
Results suggest that our method can prune up to 50% weights in BERT while yielding similar performance compared to the fine-tuned full model.
arXiv Detail & Related papers (2021-05-07T15:51:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.