GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data
- URL: http://arxiv.org/abs/2410.02755v3
- Date: Fri, 31 Jan 2025 18:21:59 GMT
- Title: GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data
- Authors: Jifan Zhang, Ziyue Luo, Jia Liu, Ness Shroff, Robert Nowak,
- Abstract summary: GPT-4o is remarkably effective at identifying high-quality training data, but its prohibitive cost makes it impractical at web-scale.
We propose SIEVE, a lightweight alternative that matches GPT-4o accuracy at less than 1% of the cost.
- Score: 12.13180744190893
- License:
- Abstract: Large language models require vast amounts of high-quality training data, but effective filtering of web-scale datasets remains a significant challenge. This paper demonstrates that GPT-4o is remarkably effective at identifying high-quality training data, but its prohibitive cost makes it impractical at web-scale. We propose SIEVE, a lightweight alternative that matches GPT-4o accuracy at less than 1\% of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight text classification models, using active learning to fine-tune these models in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. Through different filtering prompts, SIEVE can efficiently curate high quality data for general or specialized domains from web-scale corpora -- a valuable capability given the current scarcity of high-quality domain-specific datasets. Extensive experiments using automatic and human evaluation metrics show that SIEVE and GPT-4o achieve similar performance on five highly specific filtering prompts. In addition, when performing quality filtering on web crawl datasets, we demonstrate SIEVE can further improve over state-of-the-art quality filtering methods in the DataComp-LM challenge for selecting LLM pretraining data.
Related papers
- Multi-stage Large Language Model Pipelines Can Outperform GPT-4o in Relevance Assessment [6.947361774195549]
We propose a modular classification pipeline that divides the relevance assessment task into multiple stages.
One of our approaches showed an 18.4% Krippendorff's $alpha$ accuracy increase over OpenAI's GPT-4o mini.
arXiv Detail & Related papers (2025-01-24T07:33:39Z) - FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering [2.0140381995251713]
This paper introduces an LLM-based line-level filtering method to enhance training data quality.
We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines.
To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets.
arXiv Detail & Related papers (2025-01-13T13:26:50Z) - Phi-4 Technical Report [72.06109095293243]
We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality.
Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process.
arXiv Detail & Related papers (2024-12-12T03:37:41Z) - Leveraging Web-Crawled Data for High-Quality Fine-Tuning [24.19939701706869]
We argue that web-crawled data can still serve as a valuable source for high-quality supervised fine-tuning without relying on advanced models like GPT-4.
We create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data.
Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems.
arXiv Detail & Related papers (2024-08-15T08:12:52Z) - Your Vision-Language Model Itself Is a Strong Filter: Towards
High-Quality Instruction Tuning with Data Selection [59.11430077029321]
We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs)
In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM.
In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
arXiv Detail & Related papers (2024-02-19T20:08:48Z) - ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction [52.14681890859275]
E-commerce platforms require structured product data in the form of attribute-value pairs.
BERT-based extraction methods require large amounts of task-specific training data.
This paper explores using large language models (LLMs) as a more training-data efficient and robust alternative.
arXiv Detail & Related papers (2023-10-19T07:39:00Z) - Data Filtering Networks [67.827994353269]
We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset.
Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks.
Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
arXiv Detail & Related papers (2023-09-29T17:37:29Z) - InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4 [14.248735997950446]
We introduce InstructionGPT-4, which is fine-tuned on a small dataset comprising only 200 examples.
Based on these metrics, we present an effective and trainable data selector to automatically identify and filter low-quality vision-language data.
Our findings demonstrate that less but high-quality instruction tuning data is efficient in enabling multimodal large language models to generate better output.
arXiv Detail & Related papers (2023-08-23T11:27:30Z) - Is GPT-4 a Good Data Analyst? [67.35956981748699]
We consider GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains.
We design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4.
Experimental results show that GPT-4 can achieve comparable performance to humans.
arXiv Detail & Related papers (2023-05-24T11:26:59Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.