Bag of Tricks for Effective Language Model Pretraining and Downstream
Adaptation: A Case Study on GLUE
- URL: http://arxiv.org/abs/2302.09268v1
- Date: Sat, 18 Feb 2023 09:26:35 GMT
- Title: Bag of Tricks for Effective Language Model Pretraining and Downstream
Adaptation: A Case Study on GLUE
- Authors: Qihuang Zhong, Liang Ding, Keqin Peng, Juhua Liu, Bo Du, Li Shen,
Yibing Zhan and Dacheng Tao
- Abstract summary: This report briefly describes our submission Vega v1 on the General Language Understanding Evaluation leaderboard.
GLUE is a collection of nine natural language understanding tasks, including question answering, linguistic acceptability, sentiment analysis, text similarity, paraphrase detection, and natural language inference.
With our optimized pretraining and fine-tuning strategies, our 1.3 billion model sets new state-of-the-art on 4/9 tasks, achieving the best average score of 91.3.
- Score: 93.98660272309974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This technical report briefly describes our JDExplore d-team's submission
Vega v1 on the General Language Understanding Evaluation (GLUE) leaderboard,
where GLUE is a collection of nine natural language understanding tasks,
including question answering, linguistic acceptability, sentiment analysis,
text similarity, paraphrase detection, and natural language inference. [Method]
We investigate several effective strategies and choose their best combination
setting as the training recipes. As for model structure, we employ the vanilla
Transformer with disentangled attention as the basic block encoder. For
self-supervised training, we employ the representative denoising objective
(i.e., replaced token detection) in phase 1 and combine the contrastive
objective (i.e., sentence embedding contrastive learning) with it in phase 2.
During fine-tuning, several advanced techniques such as transductive
fine-tuning, self-calibrated fine-tuning, and adversarial fine-tuning are
adopted. [Results] According to our submission record (Jan. 2022), with our
optimized pretraining and fine-tuning strategies, our 1.3 billion model sets
new state-of-the-art on 4/9 tasks, achieving the best average score of 91.3.
Encouragingly, our Vega v1 is the first to exceed powerful human performance on
the two challenging tasks, i.e., SST-2 and WNLI. We believe our empirically
successful recipe with a bag of tricks could shed new light on developing
efficient discriminative large language models.
Related papers
- Teaching a Language Model to Distinguish Between Similar Details using a Small Adversarial Training Set [0.0]
We show an increase in accuracy on the adversarial test set (+ 13%) while still maintaining good performance on the original NLI task.
We also show an increase in accuracy from 91.2% to 92.9% on the most similar contradictions in the SNLI test set (as judged by cosine similarity)
arXiv Detail & Related papers (2024-10-30T15:27:55Z) - Toward Efficient Language Model Pretraining and Downstream Adaptation
via Self-Evolution: A Case Study on SuperGLUE [203.65227947509933]
This report describes our JDExplore d-team's Vega v2 submission on the SuperGLUE leaderboard.
SuperGLUE is more challenging than the widely used general language understanding evaluation (GLUE) benchmark, containing eight difficult language understanding tasks.
arXiv Detail & Related papers (2022-12-04T15:36:18Z) - UU-Tax at SemEval-2022 Task 3: Improving the generalizability of
language models for taxonomy classification through data augmentation [0.0]
This paper addresses the SemEval-2022 Task 3 PreTENS: Presupposed Taxonomies evaluating Neural Network Semantics.
The goal of the task is to identify if a sentence is deemed acceptable or not, depending on the taxonomic relationship that holds between a noun pair contained in the sentence.
We propose an effective way to enhance the robustness and the generalizability of language models for better classification.
arXiv Detail & Related papers (2022-10-07T07:41:28Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark [8.158067688043554]
This work first introduces Chinese Few-shot Learning Evaluation Benchmark (FewCLUE), the first comprehensive small sample evaluation benchmark in Chinese.
An unlabeled training set with up to 20,000 additional samples per task is provided, allowing researchers to explore better ways of using unlabeled samples.
Next, we implement a set of state-of-the-art few-shot learning methods, and compare their performance with fine-tuning and zero-shot learning schemes on the newly constructed FewCLUE benchmark.
arXiv Detail & Related papers (2021-07-15T17:51:25Z) - Making Pre-trained Language Models Better Few-shot Learners [11.90626040104822]
Recent GPT-3 model achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context.
Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient.
We present LM-BFF--better few-shot fine-tuning of language models--a suite of simple and complementary techniques for fine-tuning language models on a small number of annotated examples.
arXiv Detail & Related papers (2020-12-31T17:21:26Z) - MC-BERT: Efficient Language Pre-Training via a Meta Controller [96.68140474547602]
Large-scale pre-training is computationally expensive.
ELECTRA, an early attempt to accelerate pre-training, trains a discriminative model that predicts whether each input token was replaced by a generator.
We propose a novel meta-learning framework, MC-BERT, to achieve better efficiency and effectiveness.
arXiv Detail & Related papers (2020-06-10T09:22:19Z) - Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space [109.79957125584252]
Variational Autoencoder (VAE) can be both a powerful generative model and an effective representation learning framework for natural language.
In this paper, we propose the first large-scale language VAE model, Optimus.
arXiv Detail & Related papers (2020-04-05T06:20:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.