BERTtime Stories: Investigating the Role of Synthetic Story Data in Language pre-training
- URL: http://arxiv.org/abs/2410.15365v2
- Date: Sun, 03 Nov 2024 09:23:12 GMT
- Title: BERTtime Stories: Investigating the Role of Synthetic Story Data in Language pre-training
- Authors: Nikitas Theodoropoulos, Giorgos Filandrianos, Vassilis Lyberatos, Maria Lymperaiou, Giorgos Stamou,
- Abstract summary: We study the effect of synthetic story data in language pre-training using TinyStories.
We train GPT-Neo models on subsets of TinyStories, while varying the amount of available data.
We find that, even with access to less than 100M words, the models are able to generate high-quality, original completions to a given story.
- Score: 1.8817715864806608
- License:
- Abstract: We describe our contribution to the Strict and Strict-Small tracks of the 2nd iteration of the BabyLM Challenge. The shared task is centered around efficient pre-training given data constraints motivated by human development. In response, we study the effect of synthetic story data in language pre-training using TinyStories: a recently introduced dataset of short stories. Initially, we train GPT-Neo models on subsets of TinyStories, while varying the amount of available data. We find that, even with access to less than 100M words, the models are able to generate high-quality, original completions to a given story, and acquire substantial linguistic knowledge. To measure the effect of synthetic story data, we train LTG-BERT encoder models on a combined dataset of: a subset of TinyStories, story completions generated by GPT-Neo, and a subset of the BabyLM dataset. Our experimentation reveals that synthetic data can occasionally offer modest gains, but overall have a negative influence on linguistic understanding. Our work offers an initial study on synthesizing story data in low resource settings and underscores their potential for augmentation in data-constrained language modeling. We publicly release our models and implementation on our GitHub.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Is Child-Directed Speech Effective Training Data for Language Models? [34.46268640655943]
We train GPT-2 and RoBERTa models on 29M words of English child-directed speech.
We test whether the global developmental ordering or the local discourse ordering of children's training data supports high performance relative to other datasets.
These findings support the hypothesis that, rather than proceeding from better data, the child's learning algorithm is substantially more data-efficient than current language modeling techniques.
arXiv Detail & Related papers (2024-08-07T08:18:51Z) - Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis [3.16714407449467]
We investigate the role of translation and synthetic data in training language models.
We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model.
To rectify these issues, we pre-train the models with a small dataset of synthesized high-quality Arabic stories.
arXiv Detail & Related papers (2024-05-23T07:53:04Z) - Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese [47.45957604683302]
Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English.
We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language.
Then, we train language models containing 28M and 85M parameters on this translationese data (synthetic)
We show that their performance on downstream natural language understanding and generative tasks is only 3.56% poorer on NLU tasks and 1.51% on NLG tasks than LMs pre-trained on clean data.
arXiv Detail & Related papers (2024-03-20T14:41:01Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on
a developmentally plausible corpus [32.51325830633226]
We present the call for papers for the BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus.
This shared task is intended for participants with an interest in small scale language modeling, human language acquisition, low-resource NLP, and cognitive modeling.
arXiv Detail & Related papers (2023-01-27T15:52:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.