Rephrasing the Web: A Recipe for Compute and Data-Efficient Language
Modeling
- URL: http://arxiv.org/abs/2401.16380v1
- Date: Mon, 29 Jan 2024 18:19:08 GMT
- Title: Rephrasing the Web: A Recipe for Compute and Data-Efficient Language
Modeling
- Authors: Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang,
Navdeep Jaitly
- Abstract summary: We propose Web Rephrase Augmented Pre-training ($textbfWRAP$) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web.
We show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by $sim3x$.
At the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile, and improves zero-shot question answer accuracy across 13 tasks by more than 2%.
- Score: 27.975832264345772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models are trained on massive scrapes of the web, which are
often unstructured, noisy, and poorly phrased. Current scaling laws show that
learning from such data requires an abundance of both compute and data, which
grows with the size of the model being trained. This is infeasible both because
of the large compute costs and duration associated with pre-training, and the
impending scarcity of high-quality data on the web. In this work, we propose
Web Rephrase Augmented Pre-training ($\textbf{WRAP}$) that uses an
off-the-shelf instruction-tuned model prompted to paraphrase documents on the
web in specific styles such as "like Wikipedia" or in "question-answer format"
to jointly pre-train LLMs on real and synthetic rephrases. First, we show that
using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training
by $\sim3x$. At the same pre-training compute budget, it improves perplexity by
more than 10% on average across different subsets of the Pile, and improves
zero-shot question answer accuracy across 13 tasks by more than 2%. Second, we
investigate the impact of the re-phrasing style on the performance of the
model, offering insights into how the composition of the training data can
impact the performance of LLMs in OOD settings. Our gains are attributed to the
fact that re-phrased synthetic data has higher utility than just real data
because it (i) incorporates style diversity that closely reflects downstream
evaluation style, and (ii) has higher 'quality' than web-scraped data.
Related papers
- SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response Ranking [56.93151679231602]
This research decomposes response style into presentation and composition styles.
We introduce Style Consistency-Aware Response Ranking (SCAR)
SCAR prioritizes instruction-response pairs in the training set based on their response stylistic consistency.
arXiv Detail & Related papers (2024-06-16T10:10:37Z) - LMEraser: Large Model Unlearning through Adaptive Prompt Tuning [21.141664917477257]
LMEraser takes a divide-and-conquer strategy with a prompt tuning architecture to isolate data influence.
Experiments demonstrate that LMEraser achieves a $100$-fold reduction in unlearning costs without compromising accuracy.
arXiv Detail & Related papers (2024-04-17T04:08:38Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - When Less is More: Investigating Data Pruning for Pretraining LLMs at
Scale [12.94829977468838]
Large volumes of text data have contributed significantly to the development of large language models.
To date, efforts to prune datasets down to a higher quality subset have relied on hand-crafteds encoded as rule-based filters.
We take a wider view and explore scalable estimates of data quality that can be used to measure the quality of pretraining data.
arXiv Detail & Related papers (2023-09-08T19:34:05Z) - D4: Improving LLM Pretraining via Document De-Duplication and
Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z) - Meta-Learning Online Adaptation of Language Models [88.8947656843812]
Large language models encode impressively broad world knowledge in their parameters.
However, the knowledge in static language models falls out of date, limiting the model's effective "shelf life"
arXiv Detail & Related papers (2023-05-24T11:56:20Z) - SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training.
We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z) - DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and
Training Efficiency via Efficient Data Sampling and Routing [57.86954315102865]
DeepSpeed Data Efficiency is a framework that makes better use of data, increases training efficiency, and improves model quality.
For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost, while still maintaining 95% of model quality compared to baseline with full data and cost.
For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost.
arXiv Detail & Related papers (2022-12-07T12:27:28Z) - Cross-Modal Adapter for Text-Video Retrieval [91.9575196703281]
We present a novel $textbfCross-Modal Adapter$ for parameter-efficient fine-tuning.
Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers.
It achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets.
arXiv Detail & Related papers (2022-11-17T16:15:30Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks.
We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.