Supervised Contrastive Learning for Pre-trained Language Model
Fine-tuning
- URL: http://arxiv.org/abs/2011.01403v3
- Date: Fri, 2 Apr 2021 20:27:44 GMT
- Title: Supervised Contrastive Learning for Pre-trained Language Model
Fine-tuning
- Authors: Beliz Gunel, Jingfei Du, Alexis Conneau, Ves Stoyanov
- Abstract summary: State-of-the-art natural language understanding classification models follow two-stages.
We propose a supervised contrastive learning (SCL) objective for the fine-tuning stage.
Our proposed fine-tuning objective leads to models that are more robust to different levels of noise in the fine-tuning training data.
- Score: 23.00300794016583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art natural language understanding classification models follow
two-stages: pre-training a large language model on an auxiliary task, and then
fine-tuning the model on a task-specific labeled dataset using cross-entropy
loss. However, the cross-entropy loss has several shortcomings that can lead to
sub-optimal generalization and instability. Driven by the intuition that good
generalization requires capturing the similarity between examples in one class
and contrasting them with examples in other classes, we propose a supervised
contrastive learning (SCL) objective for the fine-tuning stage. Combined with
cross-entropy, our proposed SCL loss obtains significant improvements over a
strong RoBERTa-Large baseline on multiple datasets of the GLUE benchmark in
few-shot learning settings, without requiring specialized architecture, data
augmentations, memory banks, or additional unsupervised data. Our proposed
fine-tuning objective leads to models that are more robust to different levels
of noise in the fine-tuning training data, and can generalize better to related
tasks with limited labeled data.
Related papers
- Bilevel Fast Scene Adaptation for Low-Light Image Enhancement [50.639332885989255]
Enhancing images in low-light scenes is a challenging but widely concerned task in the computer vision.
Main obstacle lies in the modeling conundrum from distribution discrepancy across different scenes.
We introduce the bilevel paradigm to model the above latent correspondence.
A bilevel learning framework is constructed to endow the scene-irrelevant generality of the encoder towards diverse scenes.
arXiv Detail & Related papers (2023-06-02T08:16:21Z) - TWINS: A Fine-Tuning Framework for Improved Transferability of
Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks.
We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework.
TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z) - Semi-Supervised Learning Based on Reference Model for Low-resource TTS [32.731900584216724]
We propose a semi-supervised learning method for neural TTS in which labeled target data is limited.
Experimental results show that our proposed semi-supervised learning scheme with limited target data significantly improves the voice quality for test data to achieve naturalness and robustness in speech synthesis.
arXiv Detail & Related papers (2022-10-25T07:48:07Z) - Few-shot Text Classification with Dual Contrastive Consistency [31.141350717029358]
In this paper, we explore how to utilize pre-trained language model to perform few-shot text classification.
We adopt supervised contrastive learning on few labeled data and consistency-regularization on vast unlabeled data.
arXiv Detail & Related papers (2022-09-29T19:26:23Z) - ScatSimCLR: self-supervised contrastive learning with pretext task
regularization for small-scale datasets [5.2424255020469595]
We consider a problem of self-supervised learning for small-scale datasets based on contrastive loss between multiple views of the data.
We argue that the number of parameters of the whole system and the number of views can be considerably reduced while preserving the same classification accuracy.
arXiv Detail & Related papers (2021-08-31T15:58:45Z) - Revisiting LSTM Networks for Semi-Supervised Text Classification via
Mixed Objective Function [106.69643619725652]
We develop a training strategy that allows even a simple BiLSTM model, when trained with cross-entropy loss, to achieve competitive results.
We report state-of-the-art results for text classification task on several benchmark datasets.
arXiv Detail & Related papers (2020-09-08T21:55:22Z) - Deep F-measure Maximization for End-to-End Speech Understanding [52.36496114728355]
We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation.
We perform experiments on two standard fairness datasets, Adult, Communities and Crime, and also on speech-to-intent detection on the ATIS dataset and speech-to-image concept classification on the Speech-COCO dataset.
In all four of these tasks, F-measure results in improved micro-F1 scores, with absolute improvements of up to 8% absolute, as compared to models trained with the cross-entropy loss function.
arXiv Detail & Related papers (2020-08-08T03:02:27Z) - One-Shot Object Detection without Fine-Tuning [62.39210447209698]
We introduce a two-stage model consisting of a first stage Matching-FCOS network and a second stage Structure-Aware Relation Module.
We also propose novel training strategies that effectively improve detection performance.
Our method exceeds the state-of-the-art one-shot performance consistently on multiple datasets.
arXiv Detail & Related papers (2020-05-08T01:59:23Z) - Learning What Makes a Difference from Counterfactual Examples and
Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks.
We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task.
Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.