Improving Generalization of Pre-trained Language Models via Stochastic
Weight Averaging
- URL: http://arxiv.org/abs/2212.05956v1
- Date: Mon, 12 Dec 2022 15:09:56 GMT
- Title: Improving Generalization of Pre-trained Language Models via Stochastic
Weight Averaging
- Authors: Peng Lu, Ivan Kobyzev, Mehdi Rezagholizadeh, Ahmad Rashid, Ali Ghodsi,
Philippe Langlais
- Abstract summary: Knowledge Distillation (KD) is a commonly used technique for improving the generalization of compact Pre-trained Language Models (PLMs)
We adapt Weight Averaging (SWA), a method encouraging convergence to a flatter minimum, to fine-tune PLMs.
We demonstrate that our adaptation improves the generalization without extra cost.
- Score: 25.856435988848638
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge Distillation (KD) is a commonly used technique for improving the
generalization of compact Pre-trained Language Models (PLMs) on downstream
tasks. However, such methods impose the additional burden of training a
separate teacher model for every new dataset. Alternatively, one may directly
work on the improvement of the optimization procedure of the compact model
toward better generalization. Recent works observe that the flatness of the
local minimum correlates well with better generalization. In this work, we
adapt Stochastic Weight Averaging (SWA), a method encouraging convergence to a
flatter minimum, to fine-tuning PLMs. We conduct extensive experiments on
various NLP tasks (text classification, question answering, and generation) and
different model architectures and demonstrate that our adaptation improves the
generalization without extra computation cost. Moreover, we observe that this
simple optimization technique is able to outperform the state-of-the-art KD
methods for compact models.
Related papers
- Domain Generalization Guided by Large-Scale Pre-Trained Priors [24.74398777539288]
Domain generalization (DG) aims to train a model from limited source domains, allowing it to generalize to unknown target domains.
We introduce Fine-Tune with Large-scale pre-trained Priors (FT-LP)
FT-LP incorporates the pre-trained model as a prior into the DG fine-tuning process, ensuring that the model refers to its pre-trained model at each optimization step.
arXiv Detail & Related papers (2024-06-09T03:32:32Z) - MAST: Model-Agnostic Sparsified Training [4.962431253126472]
We introduce a novel optimization problem formulation that departs from the conventional way of minimizing machine learning model loss as a black-box function.
Unlike traditional formulations, the proposed approach explicitly incorporates an initially pre-trained model and random sketch operators.
We present several variants of the Gradient Descent (SGD) method adapted to the new problem formulation.
arXiv Detail & Related papers (2023-11-27T18:56:03Z) - Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders [63.28408887247742]
We study whether training procedures can be improved to yield better generalization capabilities in the resulting models.
We recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives.
arXiv Detail & Related papers (2023-11-16T10:42:58Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - Model Agnostic Sample Reweighting for Out-of-Distribution Learning [38.843552982739354]
We propose a principled method, textbfAgnostic samtextbfPLe rtextbfEweighting (textbfMAPLE) to effectively address OOD problem.
Our key idea is to find an effective reweighting of the training samples so that the standard empirical risk minimization training of a large model leads to superior OOD generalization performance.
arXiv Detail & Related papers (2023-01-24T05:11:03Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - A General Framework for Sample-Efficient Function Approximation in
Reinforcement Learning [132.45959478064736]
We propose a general framework that unifies model-based and model-free reinforcement learning.
We propose a novel estimation function with decomposable structural properties for optimization-based exploration.
Under our framework, a new sample-efficient algorithm namely OPtimization-based ExploRation with Approximation (OPERA) is proposed.
arXiv Detail & Related papers (2022-09-30T17:59:16Z) - Improving Covariance Conditioning of the SVD Meta-layer by Orthogonality [65.67315418971688]
Nearest Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR) are proposed.
Experiments on visual recognition demonstrate that our methods can simultaneously improve the covariance conditioning and generalization.
arXiv Detail & Related papers (2022-07-05T15:39:29Z) - Class-Incremental Learning with Strong Pre-trained Models [97.84755144148535]
Class-incremental learning (CIL) has been widely studied under the setting of starting from a small number of classes (base classes)
We explore an understudied real-world setting of CIL that starts with a strong model pre-trained on a large number of base classes.
Our proposed method is robust and generalizes to all analyzed CIL settings.
arXiv Detail & Related papers (2022-04-07T17:58:07Z) - Heterogeneous Calibration: A post-hoc model-agnostic framework for
improved generalization [8.815439276597818]
We introduce the notion of heterogeneous calibration that applies a post-hoc model-agnostic transformation to model outputs for improving AUC performance on binary classification tasks.
We refer to simple patterns as heterogeneous partitions of the feature space and show theoretically that perfectly calibrating each partition separately optimize AUC.
While the theoretical optimality of this framework holds for any model, we focus on deep neural networks (DNNs) and test the simplest instantiation of this paradigm on a variety of open-source datasets.
arXiv Detail & Related papers (2022-02-10T05:08:50Z) - Adapting by Pruning: A Case Study on BERT [9.963251767416967]
We propose a novel model adaptation paradigm, adapting by pruning, which prunes neural connections in the pre-trained model to optimise the performance on the target task.
We formulate adapting-by-pruning as an optimisation problem with a differentiable loss and propose an efficient algorithm to prune the model.
Results suggest that our method can prune up to 50% weights in BERT while yielding similar performance compared to the fine-tuned full model.
arXiv Detail & Related papers (2021-05-07T15:51:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.