Large Language Model-guided Document Selection
- URL: http://arxiv.org/abs/2406.04638v1
- Date: Fri, 7 Jun 2024 04:52:46 GMT
- Title: Large Language Model-guided Document Selection
- Authors: Xiang Kong, Tom Gunter, Ruoming Pang,
- Abstract summary: Large Language Model (LLM) pre-training exhausts an ever growing compute budget.
Recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs.
We explore a promising direction for scalable general-domain document selection.
- Score: 23.673690115025913
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts suggesting that domain-specific training document selection is in fact an interpretable process [Gunasekar et al., 2023], as well as research showing that instruction-finetuned LLMs are adept zero-shot data labelers [Gilardi et al.,2023], we explore a promising direction for scalable general-domain document selection; employing a prompted LLM as a document grader, we distill quality labels into a classifier model, which is applied at scale to a large, and already heavily-filtered, web-crawl-derived corpus autonomously. Following the guidance of this classifier, we drop 75% of the corpus and train LLMs on the remaining data. Results across multiple benchmarks show that: 1. Filtering allows us to quality-match a model trained on the full corpus across diverse benchmarks with at most 70% of the FLOPs, 2. More capable LLM labelers and classifier models lead to better results that are less sensitive to the labeler's prompt, 3. In-context learning helps to boost the performance of less-capable labeling models. In all cases we use open-source datasets, models, recipes, and evaluation frameworks, so that results can be reproduced by the community.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.