FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering
- URL: http://arxiv.org/abs/2501.07314v1
- Date: Mon, 13 Jan 2025 13:26:50 GMT
- Title: FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering
- Authors: Erik Henriksson, Otto Tarkka, Filip Ginter,
- Abstract summary: This paper introduces an LLM-based line-level filtering method to enhance training data quality.
We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines.
To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets.
- Score: 2.0140381995251713
- License:
- Abstract: Data quality is crucial for training Large Language Models (LLMs). Traditional heuristic filters often miss low-quality text or mistakenly remove valuable content. In this paper, we introduce an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. These labels are grouped into nine main categories, and we train a DeBERTa-v3 classifier to scale the filtering to a 10B-token subset of FineWeb. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets. The results show that models trained on the filtered data achieve higher accuracy on the HellaSwag benchmark and reach their performance targets faster, even with up to 25\% less data. This demonstrates that LLM-based line-level filtering can significantly improve data quality and training efficiency for LLMs. We release our quality-annotated dataset, FinerWeb-10BT, and the codebase to support further work in this area.
Related papers
- GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data [12.13180744190893]
GPT-4o is remarkably effective at identifying high-quality training data, but its prohibitive cost makes it impractical at web-scale.
We propose SIEVE, a lightweight alternative that matches GPT-4o accuracy at less than 1% of the cost.
arXiv Detail & Related papers (2024-10-03T17:58:29Z) - ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws [67.59263833387536]
ScalingFilter is a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data.
To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations.
arXiv Detail & Related papers (2024-08-15T17:59:30Z) - Large Language Model-guided Document Selection [23.673690115025913]
Large Language Model (LLM) pre-training exhausts an ever growing compute budget.
Recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs.
We explore a promising direction for scalable general-domain document selection.
arXiv Detail & Related papers (2024-06-07T04:52:46Z) - A Systematic Investigation of Distilling Large Language Models into Cross-Encoders for Passage Re-ranking [79.35822270532948]
Cross-encoders distilled from large language models (LLMs) are often more effective re-rankers than cross-encoders fine-tuned on manually labeled data.
We construct and release a new distillation dataset: Rank-DistiLLM.
arXiv Detail & Related papers (2024-05-13T16:51:53Z) - LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks.
We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset.
We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z) - Finetuned Multimodal Language Models Are High-Quality Image-Text Data
Filters [38.41887207958015]
We propose a novel framework for filtering image-text data by leveraging fine-tuned Multimodal Language Models (MLMs)
Our filter can generalize to different models and tasks, and be used as a drop-in replacement for CLIPScore.
arXiv Detail & Related papers (2024-03-05T06:05:15Z) - Your Vision-Language Model Itself Is a Strong Filter: Towards
High-Quality Instruction Tuning with Data Selection [59.11430077029321]
We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs)
In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM.
In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
arXiv Detail & Related papers (2024-02-19T20:08:48Z) - Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning [43.10197671420528]
We study Superfiltering: Can we use a smaller and weaker model to select data for finetuning a larger and stronger model?
This enables us to use a much smaller and more efficient model to filter the instruction data used to train a larger language model.
Not only does it largely speed up the data filtering, but the filtered-data-finetuned LLM achieves even better performance on standard benchmarks.
arXiv Detail & Related papers (2024-02-01T11:57:53Z) - Data Filtering Networks [67.827994353269]
We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset.
Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks.
Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
arXiv Detail & Related papers (2023-09-29T17:37:29Z) - Data Selection for Language Models via Importance Resampling [90.9263039747723]
We formalize the problem of selecting a subset of a large raw unlabeled dataset to match a desired target distribution.
We extend the classic importance resampling approach used in low-dimensions for LM data selection.
We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents in 4.5 hours.
arXiv Detail & Related papers (2023-02-06T23:57:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.