Related papers: FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training

FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training

URL: http://arxiv.org/abs/2502.00761v2
Date: Tue, 18 Feb 2025 03:17:33 GMT
Title: FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training
Authors: Liangyu Xu, Xuemiao Zhang, Feiyu Duan, Sirui Wang, Jingang Wang, Xunliang Cai,
Abstract summary: We propose FIRE, a flexible framework for integrating multiple data quality raters.<n>Fire aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point.<n>Experiments on the SlimPajama dataset reveal that FIRE outperforms other data selection methods.
Score: 13.182375437229519
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Selecting high-quality data can significantly improve the pretraining efficiency of large language models (LLMs). Existing methods generally rely on heuristic techniques and single-quality signals, limiting their ability to evaluate data quality comprehensively. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points. Experiments on the SlimPajama dataset reveal that FIRE outperforms other data selection methods and significantly enhances the pretrained model across a wide range of downstream tasks, with a 2.9% average performance improvement over Random and reducing the FLOPs necessary to achieve a certain performance level by more than half.

Related papers

CLUES: Collaborative High-Quality Data Selection for LLMs via Training Dynamics [38.09168541922346]
This paper proposes a novel data quality control technique based on the notion of data influence on the training dynamics of language models (LLMs)<n>We then leverage the influence of the training dynamics to select high-quality data from different private domains.<n> Experiments show that training on the high-quality data selected by our method can often outperform other data selection methods for collaborative fine-tuning of LLMs.
arXiv Detail & Related papers (2025-07-02T06:19:40Z)
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining [12.872792775510172]
We introduce a unified data selection framework called QuaDMix, which automatically optimize the data distribution for large language models pretraining. Our experiments across diverse models and datasets demonstrate that QuaDMix achieves an average performance improvement of 7.2% across multiple benchmarks.
arXiv Detail & Related papers (2025-04-23T08:36:50Z)
Call for Rigor in Reporting Quality of Instruction Tuning Data [7.284192559306471]
Studies emphasize the significance of the quality of instruction tuning (IT) data.<n>We demonstrate the potential problems arising from this practice and emphasize the need for careful consideration in verifying data quality.
arXiv Detail & Related papers (2025-03-04T02:04:58Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Data Quality Control in Federated Instruction-tuning of Large Language Models [43.29678396558287]
Federated Learning enables privacy-preserving collaborative instruction tuning of large language models. Local clients lack global visibility to filter noisy or low-quality samples before training. We propose FedDQC, a novel federated instruction tuning framework with dynamic data quality control.
arXiv Detail & Related papers (2024-10-15T12:14:57Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.<n>Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.<n>We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.<n>Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z)
Synth-Empathy: Towards High-Quality Synthetic Empathy Data [23.891966228508476]
Synth-Empathy is a pipeline that automatically generates high-quality empathetic data while discarding low-quality data. We show the trade-off between data quantity and quality, providing insights into empathetic data generation and selection.
arXiv Detail & Related papers (2024-07-31T15:12:24Z)
AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration [0.0]
This thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively. Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment. Thirdly, we present a generic framework for detecting various quality anomalies using AI models.
arXiv Detail & Related papers (2024-05-06T21:36:45Z)
Enhancing Data Quality in Federated Fine-Tuning of Foundation Models [54.757324343062734]
We propose a data quality control pipeline for federated fine-tuning of foundation models. This pipeline computes scores reflecting the quality of training data and determines a global threshold for a unified standard. Our experiments show that the proposed quality control pipeline facilitates the effectiveness and reliability of the model training, leading to better performance.
arXiv Detail & Related papers (2024-03-07T14:28:04Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
Data Diversity Matters for Robust Instruction Tuning [129.83575908023312]
Recent works have shown that by curating high quality and diverse instruction tuning datasets, we can significantly improve instruction-following capabilities. We propose a new algorithm, Quality-Diversity Instruction Tuning (QDIT) to control dataset diversity and quality. We validate the performance of QDIT on several large scale instruction tuning datasets, where we find it can substantially improve worst and average case performance.
arXiv Detail & Related papers (2023-11-21T19:12:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.