Data-Juicer: A One-Stop Data Processing System for Large Language Models
- URL: http://arxiv.org/abs/2309.02033v3
- Date: Wed, 20 Dec 2023 08:27:40 GMT
- Title: Data-Juicer: A One-Stop Data Processing System for Large Language Models
- Authors: Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge,
Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding,
Jingren Zhou
- Abstract summary: A data recipe is a mixture of data from different sources for training Large Language Models (LLMs)
We build a new system named Data-Juicer, with which we can efficiently generate diverse data recipes.
The data recipes derived with Data-Juicer gain notable improvements on state-of-the-art LLMs.
- Score: 73.27731037450995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The immense evolution in Large Language Models (LLMs) has underscored the
importance of massive, heterogeneous, and high-quality data. A data recipe is a
mixture of data from different sources for training LLMs, which plays a vital
role in LLMs' performance. Existing open-source tools for LLM data processing
are mostly tailored for specific data recipes. To continuously uncover the
potential of LLMs, incorporate data from new sources, and improve LLMs'
performance, we build a new system named Data-Juicer, with which we can
efficiently generate diverse data recipes, explore different possibilities in
forming data mixtures, and evaluate their effects on model performance.
Different from traditional data-analytics pipelines, Data-Juicer faces some
unique challenges. Firstly, the possible data sources for forming data recipes
are truly heterogeneous and massive with various qualities. Secondly, it is
extremely expensive to precisely evaluate data recipes' impact on LLMs'
performance. Thirdly, the end users of Data-Juicer, model developers, need
sufficient flexibility to configure and evaluate different data recipes.
Data-Juicer features a fine-grained abstraction of pipelines for constructing
data recipes, with over 50 built-in operators for easy composition and
extension. By incorporating visualization and auto-evaluation capabilities,
Data-Juicer enables a timely feedback loop for both LLM pre-training and
fine-tuning. Further, Data-Juicer is optimized and integrated with ecosystems
for LLM training, evaluation, and distributed computing. The data recipes
derived with Data-Juicer gain notable improvements on state-of-the-art LLMs, by
up to 7.45% increase in averaged score across 16 LLM benchmarks and 17.5%
higher win rate in pair-wise GPT-4 evaluations. Our system, data recipes, and
tutorials are released, calling for broader data-centric research on training
and understanding LLMs.
Related papers
- Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning [56.795078085234195]
LLM pruning approaches universally rely on the C4 dataset as the calibration data for calculating pruning scores.
In this study, we evaluate the choice of calibration data on LLM pruning, across a wide range of datasets.
Our results also uncover several subtle and often unexpected findings.
arXiv Detail & Related papers (2024-10-09T22:00:19Z) - Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss.
Based on the findings of the entropy law, we propose a quite efficient and universal data selection method.
We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z) - GOLD: Generalized Knowledge Distillation via Out-of-Distribution-Guided Language Data Generation [21.56082253577229]
Gold is a task-agnostic data generation and knowledge distillation framework.
It employs an iterative out-of-distribution-guided feedback mechanism for the LLM.
An energy-based OOD evaluation approach is also introduced to deal with noisy generated data.
arXiv Detail & Related papers (2024-03-28T18:08:22Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - SEED: Domain-Specific Data Curation With Large Language Models [22.54280367957015]
We present SEED, an LLM-as-compiler approach that automatically generates domain-specific data curation solutions via Large Language Models (LLMs)
SEED features an that automatically selects from the four LLM-assisted modules and forms a hybrid execution pipeline that best fits the task at hand.
arXiv Detail & Related papers (2023-10-01T17:59:20Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets.
Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.