Balancing Label Quantity and Quality for Scalable Elicitation
- URL: http://arxiv.org/abs/2410.13215v2
- Date: Mon, 21 Oct 2024 01:32:37 GMT
- Title: Balancing Label Quantity and Quality for Scalable Elicitation
- Authors: Alex Mallen, Nora Belrose,
- Abstract summary: We study the microeconomics of the quantity-quality tradeoff on binary NLP classification tasks.
We observe three regimes of eliciting classification knowledge from pretrained models using supervised finetuning.
We find that the accuracy of supervised fine-tuning can be improved by up to 5 percentage points at a fixed labeling budget.
- Score: 2.2143065226946423
- License:
- Abstract: Scalable oversight studies methods of training and evaluating AI systems in domains where human judgment is unreliable or expensive, such as scientific research and software engineering in complex codebases. Most work in this area has focused on methods of improving the quality of labels. Recent work by Burns et al. (2023) considers the complementary problem of training models with low-quality labels, finding that large pretrained models often have an inductive bias towards producing correct answers. In practice, however, neither label quantity nor quality is fixed: practitioners face a quantity-quality tradeoff. In this paper, we explore the microeconomics of the quantity-quality tradeoff on binary NLP classification tasks used in Burns et al. (2023). While sample-efficient learning has been studied extensively, little public research has focused on scalable elicitation: eliciting capabilities from pretrained models subject to labeling cost constraints. We find that this setting has novel dynamics caused by the tradeoff between label quantity and quality, as well as the model's existing latent capabilities. We observe three regimes of eliciting classification knowledge from pretrained models using supervised finetuning: quantity-dominant, quality-dominant, and a mixed regime involving the use of low- and high-quality data together to attain higher accuracy at a lower cost than using either alone. We explore sample-efficient elicitation methods that make use of two datasets of differing qualities, and establish a Pareto frontier of scalable elicitation methods that optimally trade off labeling cost and classifier performance. We find that the accuracy of supervised fine-tuning can be improved by up to 5 percentage points at a fixed labeling budget by adding a few-shot prompt to make use of the model's existing knowledge of the task.
Related papers
- "All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations [0.0]
"Gold" and "ground truth" human-mediated labels have error.
This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans.
arXiv Detail & Related papers (2024-11-23T19:18:08Z) - Dual-Decoupling Learning and Metric-Adaptive Thresholding for Semi-Supervised Multi-Label Learning [81.83013974171364]
Semi-supervised multi-label learning (SSMLL) is a powerful framework for leveraging unlabeled data to reduce the expensive cost of collecting precise multi-label annotations.
Unlike semi-supervised learning, one cannot select the most probable label as the pseudo-label in SSMLL due to multiple semantics contained in an instance.
We propose a dual-perspective method to generate high-quality pseudo-labels.
arXiv Detail & Related papers (2024-07-26T09:33:53Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z) - SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised
Learning [101.86916775218403]
This paper revisits the popular pseudo-labeling methods via a unified sample weighting formulation.
We propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training.
In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.
arXiv Detail & Related papers (2023-01-26T03:53:25Z) - Going Beyond One-Hot Encoding in Classification: Can Human Uncertainty
Improve Model Performance? [14.610038284393166]
We show that label uncertainty is explicitly embedded into the training process via distributional labels.
The incorporation of label uncertainty helps the model to generalize better to unseen data and increases model performance.
Similar to existing calibration methods, the distributional labels lead to better-calibrated probabilities, which in turn yield more certain and trustworthy predictions.
arXiv Detail & Related papers (2022-05-30T17:19:11Z) - Quantity vs Quality: Investigating the Trade-Off between Sample Size and
Label Reliability [0.0]
We study learning in probabilistic domains where the learner may receive incorrect labels but can improve the reliability of labels by repeatedly sampling them.
We motivate this problem in an application to compare the strength of poker hands where the training signal depends on the hidden community cards.
We propose two different validation strategies; switching from lower to higher validations over the course of training and using chi-square statistics to approximate the confidence in obtained labels.
arXiv Detail & Related papers (2022-04-20T13:52:00Z) - Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets.
To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data.
We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z) - Label, Verify, Correct: A Simple Few Shot Object Detection Method [93.84801062680786]
We introduce a simple pseudo-labelling method to source high-quality pseudo-annotations from a training set.
We present two novel methods to improve the precision of the pseudo-labelling process.
Our method achieves state-of-the-art or second-best performance compared to existing approaches.
arXiv Detail & Related papers (2021-12-10T18:59:06Z) - Improving Medical Annotation Quality to Decrease Labeling Burden Using
Stratified Noisy Cross-Validation [3.690031561736533]
Variability in diagnosis of medical images is well established; variability in training and attention to task among medical labelers may exacerbate this issue.
Noisy Cross-Validation splits the training data into halves, and has been shown to identify low-quality labels in computer vision tasks.
In this work we introduce Stratified Noisy Cross-Validation (SNCV), an extension of noisy cross validation.
arXiv Detail & Related papers (2020-09-22T23:32:59Z) - Mitigating Class Boundary Label Uncertainty to Reduce Both Model Bias
and Variance [4.563176550691304]
We investigate a new approach to handle inaccuracy and uncertainty in the training data labels.
Our method can reduce both bias and variance by estimating the pointwise label uncertainty of the training set.
arXiv Detail & Related papers (2020-02-23T18:24:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.