Scaling Data-Constrained Language Models
- URL: http://arxiv.org/abs/2305.16264v4
- Date: Thu, 26 Oct 2023 02:24:24 GMT
- Title: Scaling Data-Constrained Language Models
- Authors: Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao,
Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel
- Abstract summary: We investigate scaling language models in data-constrained regimes.
We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data.
We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
- Score: 137.17302576977346
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The current trend of scaling language models involves increasing both
parameter count and training dataset size. Extrapolating this trend suggests
that training dataset size may soon be limited by the amount of text data
available on the internet. Motivated by this limit, we investigate scaling
language models in data-constrained regimes. Specifically, we run a large set
of experiments varying the extent of data repetition and compute budget,
ranging up to 900 billion training tokens and 9 billion parameter models. We
find that with constrained data for a fixed compute budget, training with up to
4 epochs of repeated data yields negligible changes to loss compared to having
unique data. However, with more repetition, the value of adding compute
eventually decays to zero. We propose and empirically validate a scaling law
for compute optimality that accounts for the decreasing value of repeated
tokens and excess parameters. Finally, we experiment with approaches mitigating
data scarcity, including augmenting the training dataset with code data or
removing commonly used filters. Models and datasets from our 400 training runs
are freely available at https://github.com/huggingface/datablations.
Related papers
- Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation.
By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z) - Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets.
Data curation strategies are typically developed agnostic of the available compute for training.
We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - Post-training Model Quantization Using GANs for Synthetic Data
Generation [57.40733249681334]
We investigate the use of synthetic data as a substitute for the calibration with real data for the quantization method.
We compare the performance of models quantized using data generated by StyleGAN2-ADA and our pre-trained DiStyleGAN, with quantization using real data and an alternative data generation method based on fractal images.
arXiv Detail & Related papers (2023-05-10T11:10:09Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Data Aggregation for Reducing Training Data in Symbolic Regression [0.0]
This work discusses methods to reduce the training data and thereby also the runtime of genetic programming.
K-means clustering and data binning is used for data aggregation and compared with random sampling as the simplest data reduction method.
The performance of genetic programming is compared with random forests and linear regression.
arXiv Detail & Related papers (2021-08-24T11:58:17Z) - Scaling Laws for Transfer [0.5432984841650929]
We study scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting.
We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size.
arXiv Detail & Related papers (2021-02-02T04:07:38Z) - Scaling Laws for Neural Language Models [14.472857826717613]
We study scaling laws for language model performance on the cross-entropy loss.
The loss scales as a power-law with model size, dataset size, and the amount of compute used for training.
arXiv Detail & Related papers (2020-01-23T03:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.