How to Train Data-Efficient LLMs
- URL: http://arxiv.org/abs/2402.09668v1
- Date: Thu, 15 Feb 2024 02:27:57 GMT
- Title: How to Train Data-Efficient LLMs
- Authors: Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan
Hong, Ed H. Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng
- Abstract summary: We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
- Score: 56.41105687693619
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The training of large language models (LLMs) is expensive. In this paper, we
study data-efficient approaches for pre-training LLMs, i.e., techniques that
aim to optimize the Pareto frontier of model quality and training resource/data
consumption. We seek to understand the tradeoffs associated with data selection
routines based on (i) expensive-to-compute data-quality estimates, and (ii)
maximization of coverage and diversity-based measures in the feature space. Our
first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of
instruction-tuned LLMs to directly assess the quality of a training example. To
target coverage, we propose Density sampling, which models the data
distribution to select a diverse sample. In our comparison of 19 samplers,
involving hundreds of evaluation tasks and pre-training runs, we find that
Ask-LLM and Density are the best methods in their respective categories.
Coverage sampling can recover the performance of the full data, while models
trained on Ask-LLM data consistently outperform full-data training -- even when
we reject 90% of the original dataset, while converging up to 70% faster.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining [40.21546440726592]
We propose a novel multi-agent collaborative data selection mechanism for large language models (LLMs) pretraining.
In this framework, each data selection method serves as an independent agent, and an agent console is designed to dynamically integrate the information from all agents.
arXiv Detail & Related papers (2024-10-10T16:45:28Z) - Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations.
In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark.
arXiv Detail & Related papers (2024-09-09T17:23:29Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss.
Based on the findings of the entropy law, we propose a quite efficient and universal data selection method.
We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.