The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
with Web Data, and Web Data Only
- URL: http://arxiv.org/abs/2306.01116v1
- Date: Thu, 1 Jun 2023 20:03:56 GMT
- Title: The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
with Web Data, and Web Data Only
- Authors: Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru,
Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei,
Julien Launay
- Abstract summary: We show that properly filtered and deduplicated web data alone can lead to powerful models.
We release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.
- Score: 48.498376125522114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models are commonly trained on a mixture of filtered web data
and curated high-quality corpora, such as social media conversations, books, or
technical papers. This curation process is believed to be necessary to produce
performant models with broad zero-shot generalization abilities. However, as
larger models requiring pretraining on trillions of tokens are considered, it
is unclear how scalable is curation and whether we will run out of unique
high-quality data soon. At variance with previous beliefs, we show that
properly filtered and deduplicated web data alone can lead to powerful models;
even significantly outperforming models from the state-of-the-art trained on
The Pile. Despite extensive filtering, the high-quality data we extract from
the web is still plentiful, and we are able to obtain five trillion tokens from
CommonCrawl. We publicly release an extract of 600 billion tokens from our
RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.
Related papers
- Leveraging Web-Crawled Data for High-Quality Fine-Tuning [24.19939701706869]
We argue that web-crawled data can still serve as a valuable source for high-quality supervised fine-tuning without relying on advanced models like GPT-4.
We create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data.
Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems.
arXiv Detail & Related papers (2024-08-15T08:12:52Z) - The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale [30.955171096569618]
FineWeb is a 15-trillion token dataset derived from 96 Common Crawl snapshots.
FineWeb-Edu is a 1.3-trillion token collection of educational text filtered from FineWeb.
arXiv Detail & Related papers (2024-06-25T13:50:56Z) - Scalable Extraction of Training Data from (Production) Language Models [93.7746567808049]
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset.
We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT.
arXiv Detail & Related papers (2023-11-28T18:47:03Z) - The Falcon Series of Open Language Models [36.93493444130304]
We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora.
The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run.
Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1.
arXiv Detail & Related papers (2023-11-28T15:12:47Z) - OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text [32.15651290548974]
We introduce OpenWebMath, an open dataset inspired by works containing 14.7B tokens of webpages from Common Crawl.
We run small-scale experiments by training 1.4B parameter language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data.
arXiv Detail & Related papers (2023-10-10T16:57:28Z) - WanJuan: A Comprehensive Multimodal Dataset for Advancing English and
Chinese Large Models [69.96148259273065]
"Wan Juan" is a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources.
It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale.
arXiv Detail & Related papers (2023-08-21T14:40:48Z) - "Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow [5.036273913335737]
We train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $$187$ and $$800$ each.
Results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
arXiv Detail & Related papers (2023-06-05T21:38:30Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks.
We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks.
We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z) - Extracting Training Data from Large Language Models [78.3839333127544]
This paper demonstrates that an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.
We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data.
arXiv Detail & Related papers (2020-12-14T18:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.