Related papers: Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

URL: http://arxiv.org/abs/2504.14194v2
Date: Thu, 01 May 2025 02:37:14 GMT
Title: Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Authors: Xinlin Zhuang, Jiahui Peng, Ren Ma, Yinfan Wang, Tianyi Bai, Xingjian Wei, Jiantao Qiu, Chi Zhang, Ying Qian, Conghui He,
Abstract summary: We propose PRRC to evaluate data quality across Professionalism, Readability, Reasoning, and Cleanliness.<n>We introduce Meta-rater, a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings.<n>Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23, with scalable benefits observed in 3.3B models trained on 100B tokens.
Score: 7.61977883644433
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose PRRC to evaluate data quality across Professionalism, Readability, Reasoning, and Cleanliness. We further introduce Meta-rater, a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23, with scalable benefits observed in 3.3B models trained on 100B tokens. Additionally, we release the annotated SlimPajama-627B dataset, labeled across 25 quality metrics (including PRRC), to advance research in data-centric LLM development. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability.

Related papers

Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning [44.53583316198435]
Supervised Fine-Tuning (SFT) Large Language Models rely on high-quality training data.<n>We introduce Middo, a self-evolving Model-informed dynamic data optimization framework.<n>We show that Middo consistently enhances the quality of seed data and boosts LLM's performance with improving accuracy by 7.15% on average.
arXiv Detail & Related papers (2025-08-29T12:47:27Z)
Anyprefer: An Agentic Framework for Preference Data Synthesis [62.3856754548222]
We propose Anyprefer, a framework designed to synthesize high-quality preference data for aligning the target model.<n> external tools are introduced to assist the judge model in accurately rewarding the target model's responses.<n>The synthesized data is compiled into a new preference dataset, Anyprefer-V1, consisting of 58K high-quality preference pairs.
arXiv Detail & Related papers (2025-04-27T15:21:59Z)
PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection [28.442470930703337]
PRISM is a training-free approach for efficient multimodal data selection.<n>It uses Pearson correlation analysis to quantify the intrinsic visual encoding properties of MLLMs.<n>It reduces the overall time required for visual instruction tuning and data selection to just 30% of conventional methods.
arXiv Detail & Related papers (2025-02-17T18:43:41Z)
Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging [35.53877806259048]
This paper establishes the first comprehensive benchmark for model merging in large language models (LLMs)<n>Our analysis reveals three pivotal insights: (i) previously overlooked collaborative/conflicting relationships among 3H dimensions, (ii) the consistent superiority of model merging over data mixture approaches in balancing alignment trade-offs, and (iii) the critical role of parameter-level conflict resolution through redundant component pruning and outlier mitigation.<n>We propose R-TSVM, a Reweighting-enhanced Task Singular Vector Merging method that incorporates outlier-aware parameter weighting and sparsity-adaptive rank selection strategies adapted to the heavy-tailed parameter
arXiv Detail & Related papers (2025-02-08T11:56:58Z)
Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning [59.11519451499754]
Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences. Recent work has shown DPO's effectiveness relies on training data quality. We discover that reference model probability space naturally detects high-quality training samples.
arXiv Detail & Related papers (2025-01-25T07:21:50Z)
Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights [11.237906163959908]
Multimodal models are trained on large-scale web-crawled datasets.<n>These datasets often contain noise, bias, and irrelevant information.<n>We propose an efficient, model-based approach using the Mimic Score.
arXiv Detail & Related papers (2025-01-12T04:28:14Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Scaling Parameter-Constrained Language Models with Quality Data [32.35610029333478]
Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters. We extend the conventional understanding of scaling law by offering a microscopic view of data quality within the original formulation.
arXiv Detail & Related papers (2024-10-04T02:07:17Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality. We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z)
Has Your Pretrained Model Improved? A Multi-head Posterior Based Approach [25.927323251675386]
We leverage the meta-features associated with each entity as a source of worldly knowledge and employ entity representations from the models. We propose using the consistency between these representations and the meta-features as a metric for evaluating pre-trained models. Our method's effectiveness is demonstrated across various domains, including models with relational datasets, large language models and image models.
arXiv Detail & Related papers (2024-01-02T17:08:26Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
SelectFormer: Private and Practical Data Selection for Transformers [17.828547661524688]
This paper makes it practical for the purpose of data selection to be done over Multi-Party Computation (MPC)<n>Compared to directly evaluating the target model over MPC, our method reduces the delay from thousands of hours to tens of hours, while only seeing around 0.20% accuracy degradation from training with the selected data.
arXiv Detail & Related papers (2023-10-03T18:52:57Z)
Evaluating Representations with Readout Model Switching [19.907607374144167]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric. We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions. The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z)
Prototypical Fine-tuning: Towards Robust Performance Under Varying Data Sizes [47.880781811936345]
We propose a novel framework for fine-tuning pretrained language models (LM) Our prototypical fine-tuning approach can automatically adjust the model capacity according to the number of data points and the model's inherent attributes.
arXiv Detail & Related papers (2022-11-24T14:38:08Z)
Discover, Explanation, Improvement: An Automatic Slice Detection Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints. This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks. Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
Conservative Objective Models for Effective Offline Model-Based Optimization [78.19085445065845]
Computational design problems arise in a number of settings, from synthetic biology to computer architectures. We propose a method that learns a model of the objective function that lower bounds the actual value of the ground-truth objective on out-of-distribution inputs. COMs are simple to implement and outperform a number of existing methods on a wide range of MBO problems.
arXiv Detail & Related papers (2021-07-14T17:55:28Z)
How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.