DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models
- URL: http://arxiv.org/abs/2410.05639v1
- Date: Tue, 8 Oct 2024 02:42:56 GMT
- Title: DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models
- Authors: Ranchi Zhao, Zhen Leng Thai, Yifan Zhang, Shengding Hu, Yunqi Ba, Jie Zhou, Jie Cai, Zhiyuan Liu, Maosong Sun,
- Abstract summary: We introduce DecorateLM, a data engineering method designed to refine the pretraining corpus through data rating, tagging and editing.
We then apply DecorateLM to enhance 100 billion tokens of the training corpus, selecting 45 billion tokens that exemplify high quality and diversity for the further training of another 1.2 billion parameter LLM.
Our results demonstrate that employing such high-quality data can significantly boost model performance, showcasing a powerful approach to enhance the quality of the pretraining corpus.
- Score: 78.51470038301436
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of Large Language Models (LLMs) is substantially influenced by the pretraining corpus, which consists of vast quantities of unsupervised data processed by the models. Despite its critical role in model performance, ensuring the quality of this data is challenging due to its sheer volume and the absence of sample-level quality annotations and enhancements. In this paper, we introduce DecorateLM, a data engineering method designed to refine the pretraining corpus through data rating, tagging and editing. Specifically, DecorateLM rates texts against quality criteria, tags texts with hierarchical labels, and edits texts into a more formalized format. Due to the massive size of the pretraining corpus, adopting an LLM for decorating the entire corpus is less efficient. Therefore, to balance performance with efficiency, we curate a meticulously annotated training corpus for DecorateLM using a large language model and distill data engineering expertise into a compact 1.2 billion parameter small language model (SLM). We then apply DecorateLM to enhance 100 billion tokens of the training corpus, selecting 45 billion tokens that exemplify high quality and diversity for the further training of another 1.2 billion parameter LLM. Our results demonstrate that employing such high-quality data can significantly boost model performance, showcasing a powerful approach to enhance the quality of the pretraining corpus.
Related papers
- Experience of Training a 1.7B-Parameter LLaMa Model From Scratch [10.39475177812483]
We share insights gained from training DMaS-LLaMa-Lite on approximately 20 billion tokens of data.
We chronicle the full training trajectory, documenting how evolving validation loss levels and downstream benchmarks reflect transitions from incoherent text to fluent, contextually grounded output.
By detailing these experiences and offering training logs, checkpoints, and sample outputs, we aim to guide future researchers and practitioners in refining their pretraining strategies.
arXiv Detail & Related papers (2024-12-17T21:15:52Z) - Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation [56.75665429851673]
This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment.
Experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%.
arXiv Detail & Related papers (2024-09-27T08:20:59Z) - How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes [2.0109318570325847]
We investigate the impact of fine-tuning the Llama 3 model using TMs from a specific organisation in the software sector.
We fine-tune separate models for each training set and evaluate their performance based on automatic metrics, BLEU, chrF++, TER, and COMET.
Our findings reveal improvement in translation performance with larger datasets across all metrics.
arXiv Detail & Related papers (2024-09-05T12:06:38Z) - Self-training Large Language Models through Knowledge Detection [26.831873737733737]
Large language models (LLMs) often necessitate extensive labeled datasets and training compute to achieve impressive performance across downstream tasks.
This paper explores a self-training paradigm, where the LLM autonomously curates its own labels and selectively trains on unknown data samples.
Empirical evaluations demonstrate significant improvements in reducing hallucination in generation across multiple subjects.
arXiv Detail & Related papers (2024-06-17T07:25:09Z) - SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response Ranking [56.93151679231602]
This research identifies two key stylistic elements in responses: linguistic form and instructional surprisal.
We introduce Style Consistency-Aware Response Ranking (SCAR), which automatically prioritizes instruction-response pairs in the training set based on their response stylistic consistency.
arXiv Detail & Related papers (2024-06-16T10:10:37Z) - Large Language Model-guided Document Selection [23.673690115025913]
Large Language Model (LLM) pre-training exhausts an ever growing compute budget.
Recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs.
We explore a promising direction for scalable general-domain document selection.
arXiv Detail & Related papers (2024-06-07T04:52:46Z) - Text Quality-Based Pruning for Efficient Training of Language Models [66.66259229732121]
We propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets.
By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances.
Experimental results over multiple models and datasets demonstrate the efficacy of this approach.
arXiv Detail & Related papers (2024-04-26T18:01:25Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Influence Scores at Scale for Efficient Language Data Sampling [3.072340427031969]
"influence scores" are used to identify important subsets of data.
In this paper, we explore the applicability of influence scores in language classification tasks.
arXiv Detail & Related papers (2023-11-27T20:19:22Z) - ZeroGen$^+$: Self-Guided High-Quality Data Generation in Efficient
Zero-Shot Learning [97.2907428983142]
ZeroGen attempts to purely use PLM to generate data and train a tiny model without relying on task-specific annotation.
We propose a noise-robust bi-level re-weighting framework which is able to learn the per-sample weights measuring the data quality without requiring any gold data.
arXiv Detail & Related papers (2022-05-25T11:38:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.