Unifying Language Learning Paradigms
- URL: http://arxiv.org/abs/2205.05131v1
- Date: Tue, 10 May 2022 19:32:20 GMT
- Title: Unifying Language Learning Paradigms
- Authors: Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal
Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler
- Abstract summary: We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
- Score: 96.35981503087567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing pre-trained models are generally geared towards a particular class
of problems. To date, there seems to be still no consensus on what the right
architecture and pre-training setup should be. This paper presents a unified
framework for pre-training models that are universally effective across
datasets and setups. We begin by disentangling architectural archetypes with
pre-training objectives -- two concepts that are commonly conflated. Next, we
present a generalized and unified perspective for self-supervision in NLP and
show how different pre-training objectives can be cast as one another and how
interpolating between different objectives can be effective. We then propose
Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse
pre-training paradigms together. We furthermore introduce a notion of mode
switching, wherein downstream fine-tuning is associated with specific
pre-training schemes. We conduct extensive ablative experiments to compare
multiple pre-training objectives and find that our method pushes the
Pareto-frontier by outperforming T5 and/or GPT-like models across multiple
diverse setups. Finally, by scaling our model up to 20B parameters, we achieve
SOTA performance on 50 well-established supervised NLP tasks ranging from
language generation (with automated and human evaluation), language
understanding, text classification, question answering, commonsense reasoning,
long text reasoning, structured knowledge grounding and information retrieval.
Our model also achieve strong results at in-context learning, outperforming
175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on
one-shot summarization. We release Flax-based T5X model checkpoints for the 20B
model at
\url{https://github.com/google-research/google-research/tree/master/ul2}.
Related papers
- The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints.
We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes.
In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models.
We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization.
We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z) - Plex: Towards Reliability using Pretrained Large Model Extensions [69.13326436826227]
We develop ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively.
Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol.
We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples.
arXiv Detail & Related papers (2022-07-15T11:39:37Z) - On the Role of Bidirectionality in Language Model Pre-Training [85.14614350372004]
We study the role of bidirectionality in next token prediction, text infilling, zero-shot priming and fine-tuning.
We train models with up to 6.7B parameters, and find differences to remain consistent at scale.
arXiv Detail & Related papers (2022-05-24T02:25:05Z) - Life after BERT: What do Other Muppets Understand about Language? [7.896970044689526]
We use oLMpics benchmark and psycholinguistic probing datasets for a diverse set of 29 models.
We adapt the oLMpics zero-shot setup for autoregressive models and evaluate GPT networks of different sizes.
arXiv Detail & Related papers (2022-05-21T23:57:17Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z) - Deep Ensembles for Low-Data Transfer Learning [21.578470914935938]
We study different ways of creating ensembles from pre-trained models.
We show that the nature of pre-training itself is a performant source of diversity.
We propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset.
arXiv Detail & Related papers (2020-10-14T07:59:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.