Related papers: Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$

Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$

URL: http://arxiv.org/abs/2203.17189v1
Date: Thu, 31 Mar 2022 17:12:13 GMT
Title: Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$
Authors: Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen, Kathleen Kenealy, Jonathan H. Clark, Stephan Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin Ritter, Maarten Bosma, Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel, Mark Omernick, Brennan Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua Newlan, Andrea Gesmundo
Abstract summary: $texttt5x$ and $texttseqio$ are open source software libraries for building and training language models. These libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data.
Score: 118.04625413322827
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves. Scaling can be complicated due to various factors including the need to distribute computation on supercomputer clusters (e.g., TPUs), prevent bottlenecks when infeeding data, and ensure reproducible results. In this work, we present two software libraries that ease these issues: $\texttt{t5x}$ simplifies the process of building and training large language models at scale while maintaining ease of use, and $\texttt{seqio}$ provides a task-based API for simple creation of fast and reproducible training data and evaluation pipelines. These open-source libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data. Along with the libraries, we release configurations and instructions for T5-like encoder-decoder models as well as GPT-like decoder-only architectures. $\texttt{t5x}$ and $\texttt{seqio}$ are open source and available at https://github.com/google-research/t5x and https://github.com/google/seqio, respectively.

Related papers

Complexity-aware fine-tuning [2.0393477576774752]
General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains.<n>We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy.
arXiv Detail & Related papers (2025-06-26T13:13:24Z)
SWE-smith: Scaling Data for Software Engineering Agents [100.30273957706237]
SWE-smith is a novel pipeline for generating software engineering training data at scale. We create a dataset of 50k instances sourced from 128 GitHub repositories. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark.
arXiv Detail & Related papers (2025-04-30T16:56:06Z)
$\texttt{dattri}$: A Library for Efficient Data Attribution [7.803566162554017]
Data attribution methods aim to quantify the influence of individual training samples on the prediction of artificial intelligence (AI) models. Despite a surge of new data attribution methods being developed, there lacks a comprehensive library that facilitates the development, benchmarking, and deployment of different data attribution methods. In this work, we introduce $textttdattri$, an open-source data attribution library that addresses the above needs.
arXiv Detail & Related papers (2024-10-06T17:18:09Z)
Generating QM1B with PySCF$_{\text{IPU}}$ [40.29005019051567]
This paper introduces the data generator PySCF$_textIPU$ using Intelligence Processing Units (IPUs) It allows us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets.
arXiv Detail & Related papers (2023-11-02T10:31:20Z)
Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets. DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z)
Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z)
Torch-Choice: A PyTorch Package for Large-Scale Choice Modelling with Python [11.566791864440262]
$texttttorch-choice$ is an open-source library for flexible, fast choice modeling with Python and PyTorch. $textttChoiceDataset$ provides a $textttChoiceDataset$ data structure to manage databases flexibly and memory-efficiently.
arXiv Detail & Related papers (2023-04-04T16:00:48Z)
$\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time. We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies. Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z)
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints [59.39280540478479]
We propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet.
arXiv Detail & Related papers (2022-12-09T18:57:37Z)
Chunk-based Nearest Neighbor Machine Translation [7.747003493657217]
We introduce a textitchunk-based $k$NN-MT model which retrieves chunks of tokens from the datastore, instead of a single token. Experiments on machine translation in two settings, static domain adaptation and on-the-fly'' adaptation, show that the chunk-based model leads to a significant speed-up (up to 4 times) with only a small drop in translation quality.
arXiv Detail & Related papers (2022-05-24T17:39:25Z)
Datamodels: Predicting Predictions from Training Data [86.66720175866415]
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. We show that even simple linear datamodels can successfully predict model outputs.
arXiv Detail & Related papers (2022-02-01T18:15:24Z)
Deduplicating Training Data Makes Language Models Better [50.22588162039083]
Existing language modeling datasets contain many near-duplicate examples and long repetitives. Over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets.
arXiv Detail & Related papers (2021-07-14T06:06:52Z)
Cherry-Picking Gradients: Learning Low-Rank Embeddings of Visual Data via Differentiable Cross-Approximation [53.95297550117153]
We propose an end-to-end trainable framework that processes large-scale visual data tensors by looking emphat a fraction of their entries only. The proposed approach is particularly useful for large-scale multidimensional grid data, and for tasks that require context over a large receptive field.
arXiv Detail & Related papers (2021-05-29T08:39:57Z)
HetSeq: Distributed GPU Training on Heterogeneous Infrastructure [13.689451154861203]
HetSeq is a software package that provides the capability to train large neural network models on heterogeneous infrastructure. Experiments with transformer translation and BERT language model shows that HetSeq scales over heterogeneous systems.
arXiv Detail & Related papers (2020-09-25T19:57:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.