Enhanced Nearest Neighbor Classification for Crowdsourcing
- URL: http://arxiv.org/abs/2203.00781v1
- Date: Sat, 26 Feb 2022 22:53:52 GMT
- Title: Enhanced Nearest Neighbor Classification for Crowdsourcing
- Authors: Jiexin Duan, Xingye Qiao, Guang Cheng
- Abstract summary: Crowdsourcing is an economical way to label a large amount of data.
The noise in the produced labels may deteriorate the accuracy of any classification method applied to the labelled data.
We propose an enhanced nearest neighbor classifier (ENN) to overcome this issue.
- Score: 26.19048869302787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In machine learning, crowdsourcing is an economical way to label a large
amount of data. However, the noise in the produced labels may deteriorate the
accuracy of any classification method applied to the labelled data. We propose
an enhanced nearest neighbor classifier (ENN) to overcome this issue. Two
algorithms are developed to estimate the worker quality (which is often unknown
in practice): one is to construct the estimate based on the denoised worker
labels by applying the $k$NN classifier to the expert data; the other is an
iterative algorithm that works even without access to the expert data. Other
than strong numerical evidence, our proposed methods are proven to achieve the
same regret as its oracle version based on high-quality expert data. As a
technical by-product, a lower bound on the sample size assigned to each worker
to reach the optimal convergence rate of regret is derived.
Related papers
- Improving a Named Entity Recognizer Trained on Noisy Data with a Few
Clean Instances [55.37242480995541]
We propose to denoise noisy NER data with guidance from a small set of clean instances.
Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights.
Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.
arXiv Detail & Related papers (2023-10-25T17:23:37Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.
XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.
Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - Unsupervised Crowdsourcing with Accuracy and Cost Guarantees [4.008789789191313]
We consider the problem of cost-optimal utilization of a crowdsourcing platform for binary, unsupervised classification of a collection of items.
We propose algorithms for acquiring label predictions from workers, and for inferring the true labels of items.
arXiv Detail & Related papers (2022-07-05T12:14:11Z) - Active learning for reducing labeling effort in text classification
tasks [3.8424737607413153]
Active learning (AL) is a paradigm that aims to reduce labeling effort by only using the data which the used model deems most informative.
We present an empirical study that compares different uncertainty-based algorithms BERT$_base$ as the used classifiers.
Our results show that using uncertainty-based AL with BERT$base$ outperforms random sampling of data.
arXiv Detail & Related papers (2021-09-10T13:00:36Z) - Robust Long-Tailed Learning under Label Noise [50.00837134041317]
This work investigates the label noise problem under long-tailed label distribution.
We propose a robust framework,algo, that realizes noise detection for long-tailed learning.
Our framework can naturally leverage semi-supervised learning algorithms to further improve the generalisation.
arXiv Detail & Related papers (2021-08-26T03:45:00Z) - Confident in the Crowd: Bayesian Inference to Improve Data Labelling in
Crowdsourcing [0.30458514384586394]
We present new techniques to improve the quality of the labels while attempting to reduce the cost.
This paper investigates the use of more sophisticated methods, such as Bayesian inference, to measure the performance of the labellers.
Our methods outperform the standard voting methods in both cost and accuracy while maintaining higher reliability when there is disagreement within the crowd.
arXiv Detail & Related papers (2021-05-28T17:09:45Z) - OpinionRank: Extracting Ground Truth Labels from Unreliable Expert
Opinions with Graph-Based Spectral Ranking [2.1930130356902207]
crowdsourcing has emerged as a popular, inexpensive, and efficient data mining solution for performing distributed label collection.
We propose OpinionRank, a model-free, interpretable, graph-based spectral algorithm for integrating crowdsourced annotations into reliable labels.
Our experiments show that OpinionRank performs favorably when compared against more highly parameterized algorithms.
arXiv Detail & Related papers (2021-02-11T08:12:44Z) - Tackling Instance-Dependent Label Noise via a Universal Probabilistic
Model [80.91927573604438]
This paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances.
Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness.
arXiv Detail & Related papers (2021-01-14T05:43:51Z) - EvidentialMix: Learning with Combined Open-set and Closed-set Noisy
Labels [30.268962418683955]
We study a new variant of the noisy label problem that combines the open-set and closed-set noisy labels.
Our results show that our method produces superior classification results and better feature representations than previous state-of-the-art methods.
arXiv Detail & Related papers (2020-11-11T11:15:32Z) - Improving Face Recognition by Clustering Unlabeled Faces in the Wild [77.48677160252198]
We propose a novel identity separation method based on extreme value theory.
It greatly reduces the problems caused by overlapping-identity label noise.
Experiments on both controlled and real settings demonstrate our method's consistent improvements.
arXiv Detail & Related papers (2020-07-14T12:26:50Z) - Classify and Generate Reciprocally: Simultaneous Positive-Unlabelled
Learning and Conditional Generation with Extra Data [77.31213472792088]
The scarcity of class-labeled data is a ubiquitous bottleneck in many machine learning problems.
We address this problem by leveraging Positive-Unlabeled(PU) classification and the conditional generation with extra unlabeled data.
We present a novel training framework to jointly target both PU classification and conditional generation when exposed to extra data.
arXiv Detail & Related papers (2020-06-14T08:27:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.