Scaling Language Models: Methods, Analysis & Insights from Training
Gopher
- URL: http://arxiv.org/abs/2112.11446v1
- Date: Wed, 8 Dec 2021 19:41:47 GMT
- Title: Scaling Language Models: Methods, Analysis & Insights from Training
Gopher
- Authors: Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan
Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah
Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard
Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen
Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang,
Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese,
Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden,
Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena
Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena
Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste
Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux,
Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson
d'Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan
Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew
Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed
Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem
Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu,
Geoffrey Irving
- Abstract summary: We present an analysis of Transformer-based language model performance across a wide range of model scales.
Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language.
We discuss the application of language models to AI safety and the mitigation of downstream harms.
- Score: 83.98181046650664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language modelling provides a step towards intelligent communication systems
by harnessing large repositories of written human knowledge to better predict
and understand the world. In this paper, we present an analysis of
Transformer-based language model performance across a wide range of model
scales -- from models with tens of millions of parameters up to a 280 billion
parameter model called Gopher. These models are evaluated on 152 diverse tasks,
achieving state-of-the-art performance across the majority. Gains from scale
are largest in areas such as reading comprehension, fact-checking, and the
identification of toxic language, but logical and mathematical reasoning see
less benefit. We provide a holistic analysis of the training dataset and
model's behaviour, covering the intersection of model scale with bias and
toxicity. Finally we discuss the application of language models to AI safety
and the mitigation of downstream harms.
Related papers
- Computational Models to Study Language Processing in the Human Brain: A Survey [47.81066391664416]
This paper reviews efforts in using computational models for brain research, highlighting emerging trends.
Our analysis reveals that no single model outperforms others on all datasets.
arXiv Detail & Related papers (2024-03-20T08:01:22Z) - Split and Rephrase with Large Language Models [2.499907423888049]
Split and Rephrase (SPRP) task consists in splitting complex sentences into a sequence of shorter grammatical sentences.
We evaluate large language models on the task, showing that they can provide large improvements over the state of the art on the main metrics.
arXiv Detail & Related papers (2023-12-18T10:16:37Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - RAVEN: In-Context Learning with Retrieval-Augmented Encoder-Decoder Language Models [57.12888828853409]
RAVEN is a model that combines retrieval-augmented masked language modeling and prefix language modeling.
Fusion-in-Context Learning enables the model to leverage more in-context examples without requiring additional training.
Our work underscores the potential of retrieval-augmented encoder-decoder language models for in-context learning.
arXiv Detail & Related papers (2023-08-15T17:59:18Z) - Reimagining Retrieval Augmented Language Models for Answering Queries [23.373952699385427]
We present a reality check on large language models and inspect the promise of retrieval augmented language models in comparison.
Such language models are semi-parametric, where models integrate model parameters and knowledge from external data sources to make their predictions.
arXiv Detail & Related papers (2023-06-01T18:08:51Z) - A Survey of Large Language Models [81.06947636926638]
Language modeling has been widely studied for language understanding and generation in the past two decades.
Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora.
To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size.
arXiv Detail & Related papers (2023-03-31T17:28:46Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Analyzing Bagging Methods for Language Models [0.5161531917413708]
We perform an analysis of bagging language models and compare single language models to bagged ensembles that are roughly equivalent in terms of final model size.
Our ensembling methods are at best roughly equivalent to single LM baselines.
arXiv Detail & Related papers (2022-07-19T06:30:37Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.