Related papers: k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations

k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations

URL: http://arxiv.org/abs/2203.12913v1
Date: Thu, 24 Mar 2022 08:05:06 GMT
Title: k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations
Authors: Ka Wong, Praveen Paritosh
Abstract summary: A proposed k-rater reliability (kRR) should be used as the correct data reliability for aggregated datasets. We present empirical, analytical, and bootstrap-based methods for computing kRR on WordSim-353.
Score: 2.538209532048867
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Since the inception of crowdsourcing, aggregation has been a common strategy for dealing with unreliable data. Aggregate ratings are more reliable than individual ones. However, many natural language processing (NLP) applications that rely on aggregate ratings only report the reliability of individual ratings, which is the incorrect unit of analysis. In these instances, the data reliability is under-reported, and a proposed k-rater reliability (kRR) should be used as the correct data reliability for aggregated datasets. It is a multi-rater generalization of inter-rater reliability (IRR). We conducted two replications of the WordSim-353 benchmark, and present empirical, analytical, and bootstrap-based methods for computing kRR on WordSim-353. These methods produce very similar results. We hope this discussion will nudge researchers to report kRR in addition to IRR.

Related papers

Retrieval-Augmented Generation with Estimation of Source Reliability [15.69681944254975]
Reliability-Aware RAG (RA-RAG) estimates the reliability of multiple sources and incorporates this information into both retrieval and aggregation processes. We introduce a benchmark designed to reflect real-world scenarios with heterogeneous source reliability.
arXiv Detail & Related papers (2024-10-30T12:09:29Z)
REST: Enhancing Group Robustness in DNNs through Reweighted Sparse Training [49.581884130880944]
Deep neural network (DNN) has been proven effective in various domains. However, they often struggle to perform well on certain minority groups during inference.
arXiv Detail & Related papers (2023-12-05T16:27:54Z)
A Confidence-based Partial Label Learning Model for Crowd-Annotated Named Entity Recognition [74.79785063365289]
Existing models for named entity recognition (NER) are mainly based on large-scale labeled datasets. We propose a Confidence-based Partial Label Learning (CPLL) method to integrate the prior confidence (given by annotators) and posterior confidences (learned by models) for crowd-annotated NER.
arXiv Detail & Related papers (2023-05-21T15:31:23Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study [51.33182775762785]
This paper presents an empirical study to build relation extraction systems in low-resource settings. We investigate three schemes to evaluate the performance in low-resource settings: (i) different types of prompt-based methods with few-shot labeled data; (ii) diverse balancing methods to address the long-tailed distribution issue; and (iii) data augmentation technologies and self-training to generate more labeled in-domain data.
arXiv Detail & Related papers (2022-10-19T15:46:37Z)
FedRAD: Federated Robust Adaptive Distillation [7.775374800382709]
Collaborative learning framework by typically aggregating model updates is vulnerable to model poisoning attacks from adversarial clients. We propose a novel robust aggregation method, Federated Robust Adaptive Distillation (FedRAD), to detect adversaries and robustly aggregate local models. The results show that FedRAD outperforms all other aggregators in the presence of adversaries, as well as in heterogeneous data distributions.
arXiv Detail & Related papers (2021-12-02T16:50:57Z)
Distributionally Robust Multi-Output Regression Ranking [3.9318191265352196]
We introduce a new listwise listwise learning-to-rank model called Distributionally Robust Multi-output Regression Ranking (DRMRR) DRMRR uses a Distributionally Robust Optimization framework to minimize a multi-output loss function under the most adverse distributions in the neighborhood of the empirical data distribution. Our experiments were conducted on two real-world applications, medical document retrieval, and drug response prediction.
arXiv Detail & Related papers (2021-09-27T05:19:27Z)
Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries [59.27273928454995]
Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information. We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols. We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
arXiv Detail & Related papers (2021-09-19T19:05:00Z)
RIFLE: Imputation and Robust Inference from Low Order Marginals [10.082738539201804]
We develop a statistical inference framework for regression and classification in the presence of missing data without imputation. Our framework, RIFLE, estimates low-order moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model. Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small.
arXiv Detail & Related papers (2021-09-01T23:17:30Z)
Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points. We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z)
Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics. We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data. Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z)
Cross-replication Reliability -- An Empirical Approach to Interpreting Inter-rater Reliability [2.2091544233596596]
We present a new approach to interpreting IRR that is empirical and contextualized. It is based upon benchmarking IRR against baseline measures in a replication, one of which is a novel cross-replication reliability (xRR) measure based on Cohen's kappa.
arXiv Detail & Related papers (2021-06-11T16:15:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.