What Language Model Architecture and Pretraining Objective Work Best for
Zero-Shot Generalization?
- URL: http://arxiv.org/abs/2204.05832v1
- Date: Tue, 12 Apr 2022 14:19:49 GMT
- Title: What Language Model Architecture and Pretraining Objective Work Best for
Zero-Shot Generalization?
- Authors: Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won
Chung, Iz Beltagy, Julien Launay, Colin Raffel
- Abstract summary: We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization.
We train models with over 5 billion parameters for more than 170 billion tokens.
We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
- Score: 50.84738303888189
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large pretrained Transformer language models have been shown to exhibit
zero-shot generalization, i.e. they can perform a wide variety of tasks that
they were not explicitly trained on. However, the architectures and pretraining
objectives used across state-of-the-art models differ significantly, and there
has been limited systematic comparison of these factors. In this work, we
present a large-scale evaluation of modeling choices and their impact on
zero-shot generalization. In particular, we focus on text-to-text models and
experiment with three model architectures (causal/non-causal decoder-only and
encoder-decoder), trained with two different pretraining objectives
(autoregressive and masked language modeling), and evaluated with and without
multitask prompted finetuning. We train models with over 5 billion parameters
for more than 170 billion tokens, thereby increasing the likelihood that our
conclusions will transfer to even larger scales. Our experiments show that
causal decoder-only models trained on an autoregressive language modeling
objective exhibit the strongest zero-shot generalization after purely
unsupervised pretraining. However, models with non-causal visibility on their
input trained with a masked language modeling objective followed by multitask
finetuning perform the best among our experiments. We therefore consider the
adaptation of pretrained models across architectures and objectives. We find
that pretrained non-causal decoder models can be adapted into performant
generative causal decoder models, using autoregressive language modeling as a
downstream task. Furthermore, we find that pretrained causal decoder models can
be efficiently adapted into non-causal decoder models, ultimately achieving
competitive performance after multitask finetuning. Code and checkpoints are
available at https://github.com/bigscience-workshop/architecture-objective.
Related papers
- Yi: Open Foundation Models by 01.AI [42.94680878285869]
Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models.
Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our fine chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Arena.
arXiv Detail & Related papers (2024-03-07T16:52:49Z) - Collaborative decoding of critical tokens for boosting factuality of
large language models [57.504894664689]
Finetuned and aligned models show improved abilities of instruction following and safe generation.
The common practice of using sampling during generation also increases chances of hallucination.
We introduce a collaborative decoding framework to harness the high factuality within pretrained models through the concept of critical tokens.
arXiv Detail & Related papers (2024-02-28T01:53:37Z) - StochCA: A Novel Approach for Exploiting Pretrained Models with Cross-Attention [2.66269503676104]
We introduce a novel fine-tuning method, called cross-attention (StochCA), specific to Transformer architectures.
This method modifies the Transformer's self-attention mechanism to selectively utilize knowledge from pretrained models during fine-tuning.
Our experimental results show the superiority of StochCA over state-of-the-art approaches in both areas.
arXiv Detail & Related papers (2024-02-25T13:53:49Z) - What is the best recipe for character-level encoder-only modelling? [2.792030485253753]
This paper aims to benchmark recent progress in language understanding models that output contextualised representations at the character level.
We find that our best performing character-level model exceeds the performance of a token-based model trained with the same settings on the same data.
We believe our results demonstrate the readiness of character-level models for multilingual language representation, and encourage NLP practitioners to try them as drop-in replacements for token-based models.
arXiv Detail & Related papers (2023-05-09T14:00:15Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - What Language Model to Train if You Have One Million GPU Hours? [54.32062236748831]
We study the impact of different modeling practices and their impact on zero-shot generalization.
We also study the performance of a multilingual model and how it compares to the English-only one.
All our models and code are open-sourced at https://huggingface.co/bigscience.
arXiv Detail & Related papers (2022-10-27T13:43:27Z) - Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models.
We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization.
We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z) - Language Models are General-Purpose Interfaces [109.45478241369655]
We propose to use language models as a general-purpose interface to various foundation models.
A collection of pretrained encoders perceive diverse modalities (such as vision, and language)
We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
arXiv Detail & Related papers (2022-06-13T17:34:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.