Related papers: CritiQ: Mining Data Quality Criteria from Human Preferences

CritiQ: Mining Data Quality Criteria from Human Preferences

URL: http://arxiv.org/abs/2502.19279v1
Date: Wed, 26 Feb 2025 16:33:41 GMT
Title: CritiQ: Mining Data Quality Criteria from Human Preferences
Authors: Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, Tao Gui,
Abstract summary: We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality.<n>CritiQ Flow employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments.<n>We demonstrate the effectiveness of our method in the code, math, and logic domains.
Score: 70.35346554179036
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, or careful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only $\sim$30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier- based methods, verbal criteria are more interpretable and possess reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.1 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.

Related papers

Maximizing Signal in Human-Model Preference Alignment [0.0]
This paper argues that in cases in which end users need to agree with the decisions made by ML models, models should be trained and evaluated on data that represent their preferences. We show that noise in labeling disagreement can be minimized by adhering to proven methodological best practices.
arXiv Detail & Related papers (2025-03-06T19:10:57Z)
DataMan: Data Manager for Pre-training Large Language Models [39.677609311769146]
Existing methods rely on limited intuition, lacking comprehensive and clear guidelines.<n>We derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing.<n>Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B- parameter language model.
arXiv Detail & Related papers (2025-02-26T18:01:19Z)
How to Select Datapoints for Efficient Human Evaluation of NLG Models? [57.60407340254572]
We develop a suite of selectors to get the most informative datapoints for human evaluation. We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection. In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts.
arXiv Detail & Related papers (2025-01-30T10:33:26Z)
Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z)
QuRating: Selecting High-Quality Data for Training Language Models [64.83332850645074]
We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value. We train a Qur model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria.
arXiv Detail & Related papers (2024-02-15T06:36:07Z)
DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality. We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z)
Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z)
Learning brain MRI quality control: a multi-factorial generalization problem [0.0]
This work aimed at evaluating the performances of the MRIQC pipeline on various large-scale datasets. We focused our analysis on the MRIQC preprocessing steps and tested the pipeline with and without them. We concluded that a model trained with data from a heterogeneous population, such as the CATI dataset, provides the best scores on unseen data.
arXiv Detail & Related papers (2022-05-31T15:46:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.