Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics
for Data Selection
- URL: http://arxiv.org/abs/2311.16302v1
- Date: Mon, 27 Nov 2023 20:33:54 GMT
- Title: Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics
for Data Selection
- Authors: Anusha Sabbineni and Nikhil Anand and Maria Minakova
- Abstract summary: We present ways of assessing prospective training examples for their "usefulness" or "difficulty"
We primarily experiment with entropy and Error L2-Norm (EL2N) scores.
We then conduct training data augmentation experiments using these de-identified datasets and demonstrate that score-based selection can result in a 2% decrease in semantic error rate and 4%-7% decrease in domain classification error rate when compared to the baseline technique of random selection.
- Score: 0.8192907805418583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While data selection methods have been studied extensively in active
learning, data pruning, and data augmentation settings, there is little
evidence for the efficacy of these methods in industry scale settings,
particularly in low-resource languages. Our work presents ways of assessing
prospective training examples in those settings for their "usefulness" or
"difficulty". We also demonstrate how these measures can be used in selecting
important examples for training supervised machine learning models. We
primarily experiment with entropy and Error L2-Norm (EL2N) scores. We use these
metrics to curate high quality datasets from a large pool of \textit{Weak
Signal Labeled} data, which assigns no-defect high confidence hypotheses during
inference as ground truth labels. We then conduct training data augmentation
experiments using these de-identified datasets and demonstrate that score-based
selection can result in a 2% decrease in semantic error rate and 4%-7% decrease
in domain classification error rate when compared to the baseline technique of
random selection.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Improve Cost Efficiency of Active Learning over Noisy Dataset [1.3846014191157405]
In this paper, we consider cases of binary classification, where acquiring a positive instance incurs a significantly higher cost compared to that of negative instances.
We propose a shifted normal distribution sampling function that samples from a wider range than typical uncertainty sampling.
Our simulation underscores that our proposed sampling function limits both noisy and positive label selection, delivering between 20% and 32% improved cost efficiency over different test datasets.
arXiv Detail & Related papers (2024-03-02T23:53:24Z) - Overcoming Overconfidence for Active Learning [1.2776312584227847]
We present two novel methods to address the problem of overconfidence that arises in the active learning scenario.
The first is an augmentation strategy named Cross-Mix-and-Mix (CMaM), which aims to calibrate the model by expanding the limited training distribution.
The second is a selection strategy named Ranked Margin Sampling (RankedMS), which prevents choosing data that leads to overly confident predictions.
arXiv Detail & Related papers (2023-08-21T09:04:54Z) - Are labels informative in semi-supervised learning? -- Estimating and
leveraging the missing-data mechanism [4.675583319625962]
Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models.
It can be affected by the presence of informative'' labels, which occur when some classes are more likely to be labeled than others.
We propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm.
arXiv Detail & Related papers (2023-02-15T09:18:46Z) - SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised
Learning [101.86916775218403]
This paper revisits the popular pseudo-labeling methods via a unified sample weighting formulation.
We propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training.
In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.
arXiv Detail & Related papers (2023-01-26T03:53:25Z) - ALLSH: Active Learning Guided by Local Sensitivity and Hardness [98.61023158378407]
We propose to retrieve unlabeled samples with a local sensitivity and hardness-aware acquisition function.
Our method achieves consistent gains over the commonly used active learning strategies in various classification tasks.
arXiv Detail & Related papers (2022-05-10T15:39:11Z) - Assessing the Quality of the Datasets by Identifying Mislabeled Samples [14.881597737762316]
We propose a novel statistic -- noise score -- as a measure for the quality of each data point to identify mislabeled samples.
In our work, we use the representations derived by the inference network of data quality supervised variational autoencoder (AQUAVS)
We validate our proposed statistic through experimentation by corrupting MNIST, FashionMNIST, and CIFAR10/100 datasets.
arXiv Detail & Related papers (2021-09-10T17:14:09Z) - Towards Reducing Labeling Cost in Deep Object Detection [61.010693873330446]
We propose a unified framework for active learning, that considers both the uncertainty and the robustness of the detector.
Our method is able to pseudo-label the very confident predictions, suppressing a potential distribution drift.
arXiv Detail & Related papers (2021-06-22T16:53:09Z) - Robust Fairness-aware Learning Under Sample Selection Bias [17.09665420515772]
We propose a framework for robust and fair learning under sample selection bias.
We develop two algorithms to handle sample selection bias when test data is both available and unavailable.
arXiv Detail & Related papers (2021-05-24T23:23:36Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.