FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training
- URL: http://arxiv.org/abs/2502.00761v2
- Date: Tue, 18 Feb 2025 03:17:33 GMT
- Title: FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training
- Authors: Liangyu Xu, Xuemiao Zhang, Feiyu Duan, Sirui Wang, Jingang Wang, Xunliang Cai,
- Abstract summary: We propose FIRE, a flexible framework for integrating multiple data quality raters.<n>Fire aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point.<n>Experiments on the SlimPajama dataset reveal that FIRE outperforms other data selection methods.
- Score: 13.182375437229519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Selecting high-quality data can significantly improve the pretraining efficiency of large language models (LLMs). Existing methods generally rely on heuristic techniques and single-quality signals, limiting their ability to evaluate data quality comprehensively. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points. Experiments on the SlimPajama dataset reveal that FIRE outperforms other data selection methods and significantly enhances the pretrained model across a wide range of downstream tasks, with a 2.9% average performance improvement over Random and reducing the FLOPs necessary to achieve a certain performance level by more than half.
Related papers
- QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining [12.872792775510172]
We introduce a unified data selection framework called QuaDMix, which automatically optimize the data distribution for large language models pretraining.
Our experiments across diverse models and datasets demonstrate that QuaDMix achieves an average performance improvement of 7.2% across multiple benchmarks.
arXiv Detail & Related papers (2025-04-23T08:36:50Z) - Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Data Quality Control in Federated Instruction-tuning of Large Language Models [43.29678396558287]
Federated Learning enables privacy-preserving collaborative instruction tuning of large language models.
Local clients lack global visibility to filter noisy or low-quality samples before training.
We propose FedDQC, a novel federated instruction tuning framework with dynamic data quality control.
arXiv Detail & Related papers (2024-10-15T12:14:57Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Synth-Empathy: Towards High-Quality Synthetic Empathy Data [23.891966228508476]
Synth-Empathy is a pipeline that automatically generates high-quality empathetic data while discarding low-quality data.
We show the trade-off between data quantity and quality, providing insights into empathetic data generation and selection.
arXiv Detail & Related papers (2024-07-31T15:12:24Z) - AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration [0.0]
This thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively.
Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment.
Thirdly, we present a generic framework for detecting various quality anomalies using AI models.
arXiv Detail & Related papers (2024-05-06T21:36:45Z) - Enhancing Data Quality in Federated Fine-Tuning of Foundation Models [54.757324343062734]
We propose a data quality control pipeline for federated fine-tuning of foundation models.
This pipeline computes scores reflecting the quality of training data and determines a global threshold for a unified standard.
Our experiments show that the proposed quality control pipeline facilitates the effectiveness and reliability of the model training, leading to better performance.
arXiv Detail & Related papers (2024-03-07T14:28:04Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Data Diversity Matters for Robust Instruction Tuning [129.83575908023312]
Recent works have shown that by curating high quality and diverse instruction tuning datasets, we can significantly improve instruction-following capabilities.
We propose a new algorithm, Quality-Diversity Instruction Tuning (QDIT) to control dataset diversity and quality.
We validate the performance of QDIT on several large scale instruction tuning datasets, where we find it can substantially improve worst and average case performance.
arXiv Detail & Related papers (2023-11-21T19:12:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.