Related papers: What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

URL: http://arxiv.org/abs/2204.05832v1
Date: Tue, 12 Apr 2022 14:19:49 GMT
Title: What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?
Authors: Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, Colin Raffel
Abstract summary: We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. We train models with over 5 billion parameters for more than 170 billion tokens. We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
Score: 50.84738303888189
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with three model architectures (causal/non-causal decoder-only and encoder-decoder), trained with two different pretraining objectives (autoregressive and masked language modeling), and evaluated with and without multitask prompted finetuning. We train models with over 5 billion parameters for more than 170 billion tokens, thereby increasing the likelihood that our conclusions will transfer to even larger scales. Our experiments show that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero-shot generalization after purely unsupervised pretraining. However, models with non-causal visibility on their input trained with a masked language modeling objective followed by multitask finetuning perform the best among our experiments. We therefore consider the adaptation of pretrained models across architectures and objectives. We find that pretrained non-causal decoder models can be adapted into performant generative causal decoder models, using autoregressive language modeling as a downstream task. Furthermore, we find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models, ultimately achieving competitive performance after multitask finetuning. Code and checkpoints are available at https://github.com/bigscience-workshop/architecture-objective.

Related papers

Can bidirectional encoder become the ultimate winner for downstream applications of foundation models? [1.8120356834558644]
Foundational models have the characteristics of pre-training, transfer learning, and self-supervised learning. BERT broke through the limitation of only using one-way methods for language modeling in pre-training by using a masked language model. This article analyzes one-way and bidirectional models based on GPT and BERT and compares their differences based on the purpose of the model.
arXiv Detail & Related papers (2024-11-27T03:31:14Z)
Yi: Open Foundation Models by 01.AI [42.94680878285869]
Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our fine chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Arena.
arXiv Detail & Related papers (2024-03-07T16:52:49Z)
Collaborative decoding of critical tokens for boosting factuality of large language models [57.504894664689]
Finetuned and aligned models show improved abilities of instruction following and safe generation. The common practice of using sampling during generation also increases chances of hallucination. We introduce a collaborative decoding framework to harness the high factuality within pretrained models through the concept of critical tokens.
arXiv Detail & Related papers (2024-02-28T01:53:37Z)
StochCA: A Novel Approach for Exploiting Pretrained Models with Cross-Attention [2.66269503676104]
We introduce a novel fine-tuning method, called cross-attention (StochCA), specific to Transformer architectures. This method modifies the Transformer's self-attention mechanism to selectively utilize knowledge from pretrained models during fine-tuning. Our experimental results show the superiority of StochCA over state-of-the-art approaches in both areas.
arXiv Detail & Related papers (2024-02-25T13:53:49Z)
What is the best recipe for character-level encoder-only modelling? [2.792030485253753]
This paper aims to benchmark recent progress in language understanding models that output contextualised representations at the character level. We find that our best performing character-level model exceeds the performance of a token-based model trained with the same settings on the same data. We believe our results demonstrate the readiness of character-level models for multilingual language representation, and encourage NLP practitioners to try them as drop-in replacements for token-based models.
arXiv Detail & Related papers (2023-05-09T14:00:15Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
What Language Model to Train if You Have One Million GPU Hours? [54.32062236748831]
We study the impact of different modeling practices and their impact on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. All our models and code are open-sourced at https://huggingface.co/bigscience.
arXiv Detail & Related papers (2022-10-27T13:43:27Z)
Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models. We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization. We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z)
Language Models are General-Purpose Interfaces [109.45478241369655]
We propose to use language models as a general-purpose interface to various foundation models. A collection of pretrained encoders perceive diverse modalities (such as vision, and language) We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
arXiv Detail & Related papers (2022-06-13T17:34:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.