Related papers: The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

URL: http://arxiv.org/abs/2406.17557v2
Date: Thu, 31 Oct 2024 11:37:49 GMT
Title: The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Authors: Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf,
Abstract summary: FineWeb is a 15-trillion token dataset derived from 96 Common Crawl snapshots. FineWeb-Edu is a 1.3-trillion token collection of educational text filtered from FineWeb.
Score: 30.955171096569618
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.

Related papers

Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models [107.24906866038431]
We propose REWIRE, REcycling the Web with guIded REwrite, to enrich low-quality documents so that they could become useful for training.<n>We show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks.
arXiv Detail & Related papers (2025-06-05T07:12:12Z)
Craw4LLM: Efficient Web Crawling for LLM Pretraining [45.92222494772196]
Craw4LLM is an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Craw4LLM in obtaining high-quality pretraining data.
arXiv Detail & Related papers (2025-02-19T00:31:43Z)
GneissWeb: Preparing High Quality Data for LLMs at Scale [15.596915267015797]
We introduce GneissWeb, a large dataset yielding around 10 trillion tokens. GneissWeb achieves a favorable trade-off between data quality and quantity. We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0.
arXiv Detail & Related papers (2025-02-19T00:14:29Z)
Measuring Bias of Web-filtered Text Datasets and Bias Propagation Through Training [22.53813258871828]
We investigate biases in pretraining datasets for large language models (LLMs) through dataset classification experiments. We find that neural networks can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can.
arXiv Detail & Related papers (2024-12-03T21:43:58Z)
Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation [98.92677830223786]
This work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. We propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Our proposed method achieves performance comparable to or even superior to baselines trained with many more samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations. In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark.
arXiv Detail & Related papers (2024-09-09T17:23:29Z)
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio. Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z)
SCAR: Data Selection via Style Consistency-Aware Response Ranking for Efficient Instruction-Tuning of Large Language Models [56.93151679231602]
This research identifies two key stylistic elements in responses: linguistic form and instructional surprisal.<n>Inspired by this, we introduce Style Consistency-Aware Response Ranking (SCAR)<n>SCAR prioritizes instruction-response pairs in the training set based on their response stylistic consistency.
arXiv Detail & Related papers (2024-06-16T10:10:37Z)
Zyda: A 1.3T Dataset for Open Language Modeling [10.973515151563427]
Zyda is a dataset under a permissive license comprising 1.3 trillion tokens, assembled by integrating several major respected open-source datasets into a single, high-quality corpus. Our evaluations show that Zyda not only competes favorably with other open datasets like Dolma, FineWeb, and RefinedWeb, but also substantially improves the performance of comparable models from the Pythia suite.
arXiv Detail & Related papers (2024-06-04T05:47:17Z)
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale [12.94829977468838]
Large volumes of text data have contributed significantly to the development of large language models. To date, efforts to prune datasets down to a higher quality subset have relied on hand-crafteds encoded as rule-based filters. We take a wider view and explore scalable estimates of data quality that can be used to measure the quality of pretraining data.
arXiv Detail & Related papers (2023-09-08T19:34:05Z)
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only [48.498376125522114]
We show that properly filtered and deduplicated web data alone can lead to powerful models. We release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.
arXiv Detail & Related papers (2023-06-01T20:03:56Z)
Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks. We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z)
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP [43.7219097444333]
We introduce a testbed of six publicly available data sources to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts. We find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source.
arXiv Detail & Related papers (2022-08-10T18:24:23Z)
Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An Approach [115.91099791629104]
We construct two new benchmark webly supervised fine-grained datasets, WebFG-496 and WebiNat-5089, respectively. For WebiNat-5089, it contains 5089 sub-categories and more than 1.1 million web training images, which is the largest webly supervised fine-grained dataset ever. As a minor contribution, we also propose a novel webly supervised method (termed Peer-learning'') for benchmarking these datasets.
arXiv Detail & Related papers (2021-08-05T06:28:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.