Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$
- URL: http://arxiv.org/abs/2203.17189v1
- Date: Thu, 31 Mar 2022 17:12:13 GMT
- Title: Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$
- Authors: Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James
Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz
Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee,
Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha
Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier
Garcia, Jianmo Ni, Andrew Chen, Kathleen Kenealy, Jonathan H. Clark, Stephan
Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin
Ritter, Maarten Bosma, Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel,
Mark Omernick, Brennan Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua
Newlan, Andrea Gesmundo
- Abstract summary: $texttt5x$ and $texttseqio$ are open source software libraries for building and training language models.
These libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data.
- Score: 118.04625413322827
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent neural network-based language models have benefited greatly from
scaling up the size of training datasets and the number of parameters in the
models themselves. Scaling can be complicated due to various factors including
the need to distribute computation on supercomputer clusters (e.g., TPUs),
prevent bottlenecks when infeeding data, and ensure reproducible results. In
this work, we present two software libraries that ease these issues:
$\texttt{t5x}$ simplifies the process of building and training large language
models at scale while maintaining ease of use, and $\texttt{seqio}$ provides a
task-based API for simple creation of fast and reproducible training data and
evaluation pipelines. These open-source libraries have been used to train
models with hundreds of billions of parameters on datasets with multiple
terabytes of training data.
Along with the libraries, we release configurations and instructions for
T5-like encoder-decoder models as well as GPT-like decoder-only architectures.
$\texttt{t5x}$ and $\texttt{seqio}$ are open source and available at
https://github.com/google-research/t5x and https://github.com/google/seqio,
respectively.
Related papers
- $\texttt{dattri}$: A Library for Efficient Data Attribution [7.803566162554017]
Data attribution methods aim to quantify the influence of individual training samples on the prediction of artificial intelligence (AI) models.
Despite a surge of new data attribution methods being developed, there lacks a comprehensive library that facilitates the development, benchmarking, and deployment of different data attribution methods.
In this work, we introduce $textttdattri$, an open-source data attribution library that addresses the above needs.
arXiv Detail & Related papers (2024-10-06T17:18:09Z) - Generating QM1B with PySCF$_{\text{IPU}}$ [40.29005019051567]
This paper introduces the data generator PySCF$_textIPU$ using Intelligence Processing Units (IPUs)
It allows us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms.
We highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets.
arXiv Detail & Related papers (2023-11-02T10:31:20Z) - Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes.
We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data.
We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z) - Torch-Choice: A PyTorch Package for Large-Scale Choice Modelling with
Python [11.566791864440262]
$texttttorch-choice$ is an open-source library for flexible, fast choice modeling with Python and PyTorch.
$textttChoiceDataset$ provides a $textttChoiceDataset$ data structure to manage databases flexibly and memory-efficiently.
arXiv Detail & Related papers (2023-04-04T16:00:48Z) - Chunk-based Nearest Neighbor Machine Translation [7.747003493657217]
We introduce a textitchunk-based $k$NN-MT model which retrieves chunks of tokens from the datastore, instead of a single token.
Experiments on machine translation in two settings, static domain adaptation and on-the-fly'' adaptation, show that the chunk-based model leads to a significant speed-up (up to 4 times) with only a small drop in translation quality.
arXiv Detail & Related papers (2022-05-24T17:39:25Z) - Datamodels: Predicting Predictions from Training Data [86.66720175866415]
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data.
We show that even simple linear datamodels can successfully predict model outputs.
arXiv Detail & Related papers (2022-02-01T18:15:24Z) - Deduplicating Training Data Makes Language Models Better [50.22588162039083]
Existing language modeling datasets contain many near-duplicate examples and long repetitives.
Over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data.
We develop two tools that allow us to deduplicate training datasets.
arXiv Detail & Related papers (2021-07-14T06:06:52Z) - Cherry-Picking Gradients: Learning Low-Rank Embeddings of Visual Data
via Differentiable Cross-Approximation [53.95297550117153]
We propose an end-to-end trainable framework that processes large-scale visual data tensors by looking emphat a fraction of their entries only.
The proposed approach is particularly useful for large-scale multidimensional grid data, and for tasks that require context over a large receptive field.
arXiv Detail & Related papers (2021-05-29T08:39:57Z) - HetSeq: Distributed GPU Training on Heterogeneous Infrastructure [13.689451154861203]
HetSeq is a software package that provides the capability to train large neural network models on heterogeneous infrastructure.
Experiments with transformer translation and BERT language model shows that HetSeq scales over heterogeneous systems.
arXiv Detail & Related papers (2020-09-25T19:57:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.