Related papers: On Efficient and Statistical Quality Estimation for Data Annotation

On Efficient and Statistical Quality Estimation for Data Annotation

URL: http://arxiv.org/abs/2405.11919v2
Date: Wed, 29 May 2024 06:43:37 GMT
Title: On Efficient and Statistical Quality Estimation for Data Annotation
Authors: Jan-Christoph Klie, Juan Haladjian, Marc Kirchner, Rahul Nair,
Abstract summary: Annotated datasets are an essential ingredient to train, evaluate, compare and productionalize supervised machine learning models. Quality estimation is often performed by having experts manually label instances as correct or incorrect. Basing estimates on small sample sizes, however, can lead to imprecise values for the error rate. We show that acceptance sampling can reduce the required sample sizes up to 50% while providing the same statistical guarantees.
Score: 11.216738303463751
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Annotated datasets are an essential ingredient to train, evaluate, compare and productionalize supervised machine learning models. It is therefore imperative that annotations are of high quality. For their creation, good quality management and thereby reliable quality estimates are needed. Then, if quality is insufficient during the annotation process, rectifying measures can be taken to improve it. Quality estimation is often performed by having experts manually label instances as correct or incorrect. But checking all annotated instances tends to be expensive. Therefore, in practice, usually only subsets are inspected; sizes are chosen mostly without justification or regard to statistical power and more often than not, are relatively small. Basing estimates on small sample sizes, however, can lead to imprecise values for the error rate. Using unnecessarily large sample sizes costs money that could be better spent, for instance on more annotations. Therefore, we first describe in detail how to use confidence intervals for finding the minimal sample size needed to estimate the annotation error rate. Then, we propose applying acceptance sampling as an alternative to error rate estimation We show that acceptance sampling can reduce the required sample sizes up to 50% while providing the same statistical guarantees.

Related papers

Practical estimation of the optimal classification error with soft labels and calibration [52.1410307583181]
We extend a previous work that utilizes soft labels for estimating the Bayes error, the optimal error rate.<n>We tackle a more challenging problem setting: estimation with corrupted soft labels.<n>Our method is instance-free, i.e., we do not assume access to any input instances.
arXiv Detail & Related papers (2025-05-27T06:04:57Z)
Model Evaluation in the Dark: Robust Classifier Metrics with Missing Labels [2.384873896423002]
We propose a multiple imputation technique for evaluating classifiers using metrics such as precision, recall, and ROC-AUC. We empirically show that the predictive distribution's location and shape are generally correct, even in the Missing Not At Random regime.
arXiv Detail & Related papers (2025-04-25T14:31:42Z)
Auto-Evaluation with Few Labels through Post-hoc Regression [4.813376208491175]
Prediction Powered Inference (PPI) framework provides a way of leveraging statistical power of automatic evaluation and a small pool of labelled data. We present two new PPI-based techniques that leverage robust regressors to produce even lower variance estimators in the few-label regime.
arXiv Detail & Related papers (2024-11-19T17:17:46Z)
Large Language Models Must Be Taught to Know What They Don't Know [97.90008709512921]
We show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We also investigate the mechanisms that enable reliable uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators.
arXiv Detail & Related papers (2024-06-12T16:41:31Z)
How Much is Unseen Depends Chiefly on Information About the Seen [14.365105289625399]
We find that in expectation the missing mass is entirely determined by the number $f_k$ of classes that do appear in the training data. While this is the first precise characterization of the expected missing mass in terms of the sample, the induced estimator suffers from an impractically high variance.
arXiv Detail & Related papers (2024-02-08T17:12:49Z)
Training Normalizing Flows with the Precision-Recall Divergence [73.92251251511199]
We show that achieving a specified precision-recall trade-off corresponds to minimising -divergences from a family we call the em PR-divergences We propose a novel generative model that is able to train a normalizing flow to minimise any -divergence, and in particular, achieve a given precision-recall trade-off.
arXiv Detail & Related papers (2023-02-01T17:46:47Z)
Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs) We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z)
Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance [5.650647159993238]
Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular. We show that the statistical problems with covariance estimation drive the poor performance of H-score. We propose a correction and recommend measuring correlation performance against relative accuracy in such settings.
arXiv Detail & Related papers (2021-10-13T17:24:12Z)
Low-Shot Validation: Active Importance Sampling for Estimating Classifier Performance on Rare Categories [47.050853657721596]
For machine learning models trained with limited labeled training data, validation stands to become the main bottleneck to reducing overall annotation costs. We propose a statistical validation algorithm that accurately estimates the F-score of binary classifiers for rare categories. In particular, we can estimate model F1 scores with a variance of 0.005 using as few as 100 labels.
arXiv Detail & Related papers (2021-09-13T06:01:16Z)
Identifying Wrongly Predicted Samples: A Method for Active Learning [6.976600214375139]
We propose a simple sample selection criterion that moves beyond uncertainty. We show state-of-the-art results and better rates at identifying wrongly predicted samples.
arXiv Detail & Related papers (2020-10-14T09:00:42Z)
Predicting the Accuracy of a Few-Shot Classifier [3.609538870261841]
We first analyze the reasons for the variability of generalization performances. We propose reasonable measures that we empirically demonstrate to be correlated with the generalization ability of considered classifiers.
arXiv Detail & Related papers (2020-07-08T16:31:28Z)
TraDE: Transformers for Density Estimation [101.20137732920718]
TraDE is a self-attention-based architecture for auto-regressive density estimation. We present a suite of tasks such as regression using generated samples, out-of-distribution detection, and robustness to noise in the training data.
arXiv Detail & Related papers (2020-04-06T07:32:51Z)
Information-Theoretic Probing with Minimum Description Length [74.29846942213445]
We propose an alternative to the standard probes, information-theoretic probing with minimum description length (MDL) With MDL probing, training a probe to predict labels is recast as teaching it to effectively transmit the data. We show that these methods agree in results and are more informative and stable than the standard probes.
arXiv Detail & Related papers (2020-03-27T09:35:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.